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FOREWORD 




The, Software Engineering Laboratory (SEL) is an organization 
sponsored by the National Aeronautics and Space 
Administration/Goddard Space Flight Center (NASA/GSFC) and 
created for the purpose of investigating the effectiveness 
of software engineering technologies when applied to the 
development of applications software. The SEL was created 
in 1977 and has three primary organizational members: 

NASA/GSFC (Systems Development and Analysis Branch) 

The University of Maryland (Computer Sciences Department) 
Computer Sciences Corporation (Flight Systems Operation) 

The goals of the SEL are (1) to understand the software de- 
velopment process in the GSFC environment; (2) to measure 
the effect of various methodologies, tools, and models on 
this process; and (3) to identify and then to apply success- 
ful development practices. The activities, findings, and 
recommendations of the SEL are recorded in the Software En- 
gineering Laboratory Series, a continuing series of reports 
that includes this document. The papers contained in this 
document appeared previously as indicated in each section. 

Single copies of this document can be obtained by writing to 

Frank E. McGarry 
Code 582 
NASA/GSFC 

Greenbelt, Maryland 20771 
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SECTION 1 - INTRODUCTION 


SECTION 1 - INTRODUCTION 


This document is a collection of technical papers produced 
by participants in the Software Engineering Laboratory (SEL) 
during the period January 1, 1982, through November 30, 1983. 
The purpose of the document is to make available, in one ref- 
erence, some results of SEL research that originally appeared 
in a number of different forums. This is the second such 
volume of technical papers produced by the SEL. Although 
these papers cover several topics related to software engi- 
neering, they do not encompass the entire scope of SEL activ- 
ities and interests. Additional information about the SEL 
and its research efforts may be obtained from the sources 
listed in the bibliography at the end of this document. 

For the convenience of this presentation, the nine papers 
contained here are grouped into four major categories; 

• The Software Engineering Laboratory , 

• Resource Models 

• Software Measures 

• Data Collection 

The first category presents summaries of the SEL organiza- 
tion, operation, and research activities. The second and 
third categories include papers describing the results of 
specific research projects in the areas of resource models 
and software measures, respectively. The last category 
presents papers describing strategies for data collection 
for software engineering research. 

The SEL is actively working to increase its understanding and 
to improve the software development process at Goddard Space 
Flight Center. Future efforts will be documented in addi- 
tional volumes of the Collected Software Engineering Papers 
and other SEL publications. 
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SECTION 2 - THE SOFTWARE ENGINEERING LABORATORY 



SECTION 2 - THE SOFTWARE ENGINEERING LABORATORY 


The technical papers included in this section were origi- 
nally prepared as indicated below. 

• Agresti, W. W., F. E. McGarry, D. N. Card, et al., 
"Measuring Software Technology," Computer Sciences 
Corporation, Technical Memorandum, November 1983 
(reprinted by permission of the authors) 


A version of this paper will appear in Program 
Transformation and Programmer Environments . 

New York; Springer-Verlag , 1984. 


• Basili, V. R. , "Technical Summary - 1982: Report 

to the National Aeronautics and Space Administra- 
tion," University of Maryland, Technical Memoran- 
dum, December 1982 (reprinted by permission of the 
author) 
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ABSTRACT 

Results are reported from a series of investigations into the effec- 
tiveness of various methods and tools used in a software production 
environment. The basis for the analysis is a project data base, 
built through extensive data collection and process instrumentation. 
The project profiles become an organizational memory, serving as a 
reference point for an active program of measurement and experimenta- 
tion on software technology. 

INTRODUCTION 

Many proposals aimed at improving the software development process 
have emerged during the past several years. Such approaches as 
structured design, automated development tools, software metrics, 
resource estimation models, and special management techniques have 
been directed at building, maintaining, and estimating the software 
process and product. 

Although the software development community has been presented with 
these new tools and methods, it is not clear which of them will prove 
effective in particular environments. When this question is ap- 
proached from the user's perspective, the issue is to associate with 
each programming environment a set of enabling conditions and "win" 
predicates to signal when methods can be applied and which ones will 
improve performance. Lacking such guidelines, organizations are left 
to introduce new procedures with little understanding of their likely 
effect. 

Assessing methods and tools for potential application is a central 
activity of the Software Engineering Laboratory (SEL) [1, 2]. The 
SEL was established in 1977 by the National Aeronautics and Space 
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Administration (NASA) /Goddard Space Flight Center (GSFC) in conjunc- 
tion with Computer Sciences Corporation and the University of 
Maryland. The SEL's approach is to understand and measure the soft- 
ware development process, measure the effects of new methods through 
experimentation, and apply those methods and tools that offer im- 
provement. The environment of interest supports flight dynamics ap- 
plications at NASA/GSFC. This scientific software consists primarily 
of FORTRAN, with some assembler code, and involves interactive 
graphics. The average size of a project is 60,000 to 70,000 source 
lines of code. 

SEL investigations demonstrate the advantages of building and main- 
taining an organizational memory on which to base a program of ex- 
perimentation and evaluation. Over 40 projects, involving 
1.8 million source lines of code, have been monitored since 1977. 
Project data have been collected from five sources: 

• Activity and change forms completed by programmers and man- 
agers 

• Automated computer accounting information 

• Automated tools such as code analyzers 

• Subjective evaluations by managers 

• Personal interviews 

The resulting data base contains over 25 megabytes of profile infor- 
mation on completed projects. 

Some highlights of SEI* investigations using the project history data 
base are presented here, organized into three sections: 

• Programmer Productivity 

• Cost Models 

• Technology Evaluations 

PROGRAMMER PRODUCTIVITY 

The least understood element of the software development process is 
the behavior of the programmer. One SEL study examined the distri- 
bution of programmer time spent on various activities. When specific 
dates were used to mark the end of one phase and the beginning of the 
next, 22 percent of the total hours were attributed to the design 
phase, with 48 percent for coding, and 30 percent for testing. How- 
ever, if the programmers* completed forms were used to identify ac- 
tual time spent on various activities, the breakdown was 
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approximately equal for the four categories of designing, coding, 
testing, and "other” (activities such as travel, training, and 
unknown) [3] • Although an attractive target for raising productivity 
was t6 eliminate the "other" category, the SEL found that this was 
not easily done. 

Regarding individual programmer productivity, the SEL found differ- 
ences as great as 10 to 1, where productivity was measured in lines 
of code per unit of effort [4]. This result was consistent with 
similar studies in other organizations [5] . 

COST MODELS 

Cost is often expressed in terms of the effort required to develop 
software. In the effort equation, 

E - al** 

where E equals effort in staff time and I equals size in lines of 
code, some studies reported a value of b greater than one, indicating 
that effort must be increased at a higher rate than the increase in 
system size. The SEL analysis of projects in its data base did not 
support this result, finding instead a nearly linear relationship 
between effort and size (6). This conclusion may be due to the SEL 
projects being smaller than those that would require more than a 
linear increase in effort. 

In a separate study, the SEL used cost data from projects to evaluate 
the performance of various resource estimation models. One study, 
using a subset of completed projects, compared the predictive ability 
of five models: Doty, SEL, PRICE S, Tecolote, and COCOMO [7]. The 

objective was to determine which model best characterized the SEL 
environment. The results showed that some models worked well on some 
projects, but no model emerged as a single source on which to base a 
program of estimation [8]. In the SEL environment, cost models have 
value as a supplementary tool to flag extreme cases and to reinforce 
the estimates of experienced managers. 

TECHNOLOGY EVALUATIONS 

Several SEL experiments have been conducted to assess the effective- 
ness of different process technologies. One study focused on the use 
of an independent verification and validation (IV&V) team. The 
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premise for introducing an ivav team into the software development 
process is that any added cost will be offset by the early discovery 
of errors. The expected benefit is a software product of greater 
quality and reliability. In experimenting with an IV&V team in the 
SEL environment, the benefits were not completely realized [9] . The 
record on early error detection was better with IV&V than without it, 
but the reliability of the final product was not improved. Also, the 
productivity of the development team was comparatively low, due in 
part to the necessary interaction with the IV&V team. The conclusion 
was that an IV&V team was not effective in the SEL environment, but 
may be effective where there are larger projects or higher reli- 
ability requirements. 

A recent SEL investigation measured the effect of seven specific 
techniques on productivity and reliability. From the project data 
base, indices were developed to capture the degree of use of quality 
assurance procedures, development tools, documentation, structured 
code, top-down development, code reading, and chief programmer team 
organization. The results showed that the greatest productivity and 
reliability improvements due to methodology use lie only in the range 
of 15 to 30 percent. Significant factors within this range are the 
positive effect of structured code on productivity and the positive 
effects of quality assurance, documentation, and code reading on re- 
liability (lOJ . 

Figure 1 summarizes the perceived effectiveness of various practices 
in the the SEL environment [41. The placement of the models and 
methods is based on the overhead cost of applying the model or method 
and the benefit of its use. This summary must be interpreted in the 
following context: 

• The placement reflects subjective evaluations as well as 
experimental results. 

• The chart is indicative of experiences in the SEL environ- 
ment only. 

• The dynamic nature of the situation is not apparent. The 
evaluation may reflect on an earlier and less effective ex- 
ample of the model or method. 
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OVERHEAD COST 



Figure 1. What Has Been Successful in Our Environment? 


CONCLUSIONS 

The experiences of the SEL demonstrate that statistically valid eval- 
uation is possible in the software development environment, but only 
if the prerequisite quantitative characterization of the process has 
been obtained. Through its program of assessing and applying new 
methods and tools, the SEL is actively pursuing the creation of a 
more productive software development environment. 
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Overview 


During 1982, in conjunction with NASA/GSFC Software Engineering 
Laboratory (SEL), research was conducted in 4 areas; Software Develop- 
ment Predictors, Error Analysis, Reliability Models and Software Metric 
Analysis. Summaries of the projects follow below. 

1 . Software Development Predictors 

A study is being done on the use of dynamic characteristics as 
predictors for software development. It is hoped that by examining a 
set of readily available characteristics, the project manager may be 
able to determine such things as when a project is in trouble and evalu- 
ate the quality of the product as it is being designed. 

Project DEB was selected as the control for the project since it 
was considered fairly successful and is well documented. Information 
found in the history files and resource summary files was initially 
utilized. These files were chosen because the information they contain 
is readily accessible to the managej: (ie. number of lines of code, man- 
power, computer time, etc.). Several profiles of project DEB were then 
made using this information. Project DEA's profiles were then compared 
with these results. This project was chosen because it was very similar 
to DEB but was considered less successful. 

The history file was first examined to see if any growth pattern 
existed for the lines of code. The initial look at DEA and DEB looked 
hopeful but further investigation of other projects showed no discerni- 
ble pattern. Other examinations of this file yielded similar results. 
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When a comparison of the information in the history and resource 
summary files was made some differences did appear. Initial plots used 
accumulative totals versus different time factors. These plots did 
demonstrate visible differences between the two projects. Further 
Investigation using weekly totals instead of accumulative totals showed 
an even larger difference between the projects. 

Project DBA had a higher frequency of changes at the beginning 
of the project, while at the same time, the number of hours of manpower 
reported for the interval was less. The number of computer runs made 
was higher for DEB in the part of the project where DBA was experiencing 
the higher number of changes per manpower. In all, project DBA appears 
to have had less effort placed during the early phase of the project 
which may of led to the problems in the end. Another important aspect 
of project DBA was that several thousand lines of code appear to have 
been transported. Adaptation of this code may explain the high number 
of changes initially seen in DBA. 

From this examination the following general goals and 
hypothesis have been generated; 

A) The manpower usage in the SEL environment is a discernible pattern 
and may be used as a predictor. 

1) The ideal staffing for a successful project is a two hump curve 
with the second hump beginning roughly 2/3 into the project. 

2) The two humps mentioned in hypothesis 1 should peak at approxi- 
mately the same height. 

3) The maximum peak height of the first hump is proportional to the 
final size of the project. This also hold for the second hump based 
on hypothesis two. 

4) The location of the two peaks is constant with relation to the 
amount of manpower utilized. 

5) The amount of manpower expended between the two peaks is con- 
stant. 


2-11 



6) Projects deemed less successful by subjective analysis have 
sharp changes in the amount of manpower spent per change. 

B) The pattern of changes in relation to manpower, computer runs, lines 
of code, etc. may be used as a predictor in the SEL environment. 

1 ) The amount of manpower to make a change should increase toward 
the end of a project and be stable at the beginning. 

2) The manpower per change should be lower in the beginning of the 
project. See also goal D. 

3) Projects deemed less successful by subjective analysis have 
sharp changes in the amount of manpower spent per change. 

4) The ratio of changes to computer run should decrease as the pro- 
ject evolves. 

5) The amount of computer time spent on detecting and correcting a 
given change will remain constant. 

C) The number of computer runs is closely related to the development of 
a project and may be used to judge project development. 

1) The number of computer runs remains constant during the initial 
hump of the staffing curve. The number of computer runs will drop 
during the second hump of the staffing curve. 

2) The ratio of changes to computer runs should decrease as the 
project evolves. 

D) A close examination of the types of changes and the pattern they make 
over time should be a good indication of the success of a given project. 

1 ) Time consuming changes that occur late in the project more often 
appear in modified code. 

2) Unit testing is not as extensive on modules with modified code. 
Undetected errors may cause major problems latter in development. 

3) The types of changes vary across the development of a project. 

4) The number of changes per hour of manpower is related to the 
type of changes being done. 

5) The types of change that require more time to correct occur dur- 
ing the second staffing hump. 

Several projects will now examined to test the validity of these 
finds. The change report forms will also be examined to see if the 
information in them yields any useful predictors. 

To conclude, the study has completed its initial analysis of the 
two projects. It appears there are some significant factors that could 
be useful as predictors. Further analysis may yield some information 
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that would be useful to a project manager. 


2. Error Analysis 

A) . Publication of existing results — Three papers are being prepared 
from earlier work on error analysis conducted by the SEL laboratory. 

One is on the data collection methodology and the validation of the 
accuracy of the data, the second one is on the analysis of the SEL pro- 
jects directly and the third one is a comparison of the SEL projects 
with projects of the Naval Research Laboratory. These papers are 
currently being submitted for publication and will be published as 
University of Maryland Technical Reports in the interim. 

B) . A study on software errors and complexity — The distribution and 
relationships derived from the change data collected during the develop- 
ment of the medium scale satellite project shows that meaningful results 
can be obtained which allow insight into software traits ajJd the 
environment in which it is developed. The project studied in this case 
was GMAS. Modified and new modules were shown to behave similarly. An 
abstract classification scheme for errors which allows a better under- 
standing of the overall traits, of a software project was also provided. 
Finally, various size and complexity metrics are examined with respect 
to errors detected within the software yielding some interesting 
results. A University of Maryland Technical Report describing these 
results was published [Bas82]. This paper has been submitted for publi- 
cation. 

C) . A further examination of the error characteristics of the DE_A and 
DE B projects is currently being undertaken. This error analysis is 
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being conducted using the techniques developed and documented in [Wei8l] 
and [Per82]. The focdl point of this research effort is to characterize 
errors in the NASA/GSFC software development environment. 

A preliminary review of a sample of the Change Report Forms from 
both DE A and DE_B has been conducted. The sample included only those 
CRF's for which an error change was reported. The purpose of this 
review was to 'get a flavor' for the data collected and to preliminarily 
assess the consistency of that data with the results found to date by 
SEL personnel. 

The sample included 98 CRF's from DE_A and 90 CRF s from DE_B. Of 
the 98 CRF's from DE_A, 63 (64.3$) of the errors were classified as an 
'error in the design or implementation of a single component. Of the 
90 CRF's from DE_B, 16 errors were reported as 'clerical errors. Of the 
remaining 74 DE_B errors (non-clerical errors), 6l (84.2$) of the errors 
were also classified as 'errors in the design or implementation of a 
single component . ' 

Although the percentage classified as 'errors in a single com- 
ponent' for DE_B was higher than the other studies, these preliminary 
results appear to follow the results of previous analyses [Wei8l]. As in- 
that previous work, the distribution of errors in other categories does 
not neatly fit a pattern. In fact, there are too few events in the 
other categories to draw any initial conclusions. It will be interest- 
ing to explore the reason(s) DE_B experienced a substantially larger 
number of 'clerical errors.' 

There are marked differences in the remaining DE_A and DE_B error 
reports. This may be attributable to the reported differences in the 
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two projects. It is not possible at this time to conjecture on more 
tangible causes for the differences. The full set of error change 
reports will have to be examined, for both projects. 

It is worth noting here that for DE_A, 31 of 98 error reports 
(31.6%) examined were classified as being an 'error in the design or 
implementation of more than one component.' Based on previous results 
cited above, this is an unusually high percentage. Only 4 components 
(4.1%) had errors reported that were not in the design or implementation 
of component (s) categories. 

As part of the preliminary work toward the above goal, the related 
literature released by SEL was reviewed. A conclusion reached was that 
the definitions of several critical terms were not necessarily con- 
•sistent, and often times the technical reports make too great an assump- 
tion about the uniformity of use of software engineering terms. 

'Interface' provides a good example of an ill-defined yet oft used 
term. Using the definition from [Wei8l] (the same definition is used in 
[Bas80b] and [Glo79]) it is arguable that interface errors can be cap- 
tured five ways from the CRF: 

-an error involving more than one component; 

-an error involving a common routine; 

-from textual comments in the CRF (eg; a CRF for which the error 
was entered as having affected one component but the text indicated 
that the error was in a subroutine call statement); 

-an error reported as having been located in one component but the 
change required to repair the error affected more than one com- 
ponent ; and 

-a change that caused an error because either the change invali- 
dated an assumption made elsewhere in the software or an assumption 
made about the rest of the software in the design of the change was 
incorrect (contingent on ability to capture supporting text and 
ability to distinguish from erroneous assumptions made about a sin- 
gle component). 
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An effort is currently underway to develop a more restrictive set 
of definitions for software engineering terms, specifically those that 
apply to error analysis. The basis of this effort is the set of defini- 
tions published in [Bas80] and [Glo79l and will be modified, as neces- 
sary, in consultation with those persons associated with SEL in the past 
and present, whose work is or was related to the error analysis effort. 

3. Reliability Models 

A study is being performed in the area of reliability models. This 
research includes the field of program testing because the validity of 
some reliability models depends on the answers to some unanswered ques- 
tions about testing. 

The eventual goal of this research is to understand how and when to 
use reliability models. We are investigating the use of functional 
testing because some reliability models make assumptions about the way 
program testing is accomplished [Musa]. It is not known if functional 
testing satisfies the random testing assumptions made by the reliability 
models. The validity of reliability models that use data generated by 
functional testing is uncertain until this question is answered. 

We are using structural coverage metrics to gain further insight 
into the effects of functional testing. A structural coverage metric is 
a measure of how much of a program was executed for given input data. 
Studying the coverage metric may allow us to develop other measures of 
reliability. 

An additional bonus of this research is that it allows us to com- 
pare functional testing and structural testing. It is not known how 
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these two methods of testing are related. The results of this investi- 
gation may answer that question. 

Since January background material has been studied with regard to 
reliability models, and functional and structural testing [Mueller]. A 
FORTRAN preprocessor has been written to calculate the structural cover- 
age metrics of GSFC FORTRAN source code. 

The preprocessor calculates the simplest metric, the percent of 
executable code that is executed. There are several ways to measure 
coverage [Auerbach]. One method uses interpretation of the source code. 
The interpreter records which statements are executed. At the end of 
interpretation, it writes a list of executed statements. 

The second method uses "switches", small sections of code that are 
inserted into the source program text wherever the flow of control 
diverges or converges. The switch has 2 values: 0 if it was not exe- 
cuted, 1 if it was executed. The value of the switches is output after 

execution. 

An example: 

INTEGER SWITCH ( N ) 

FOR I s 1, N 

SWITCH (I) s 0 

• 

READ ( J ); 

IF ( even ( J )) 

THEN 

SWITCH ( 1 ) = 1; 


ELSE 

SWITCH ( 2 ) = 1; 
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ENDIF 


FOR I = 1, N 

WRITE ( SWITCH ( I )); 

END 

When this program is executed, one of the two branches of the if 
statement will be executed. By examining the values of the array 
SWITCH, we can determine what code was executed. By analyzing the code 
and counting statements, the number of statements executed can be deter- 
mined. In practice, the amount of data generated will be large. 

Software tools are needed to help analyze the data. 

The switches can be inserted by a preprocessor (before compilation) 
or by a compiler (during compilation). The switches may be in-line code 
(as in the example) or a call to a switch subroutine that records the 
flow of control. 

This latter approach was taken and a preprocessor was developed 
that runs on VAX/Unix at UMCP. The preprocessor takes a copy of the 
input source code, and modifies it. This modified copy will be returned 
to the source computer (at GSFC) where it will be compiled and executed. 
The execution produces the desired coverage data. The coverage data 
will be returned to the University for analysis. 

Many things remain to be done before we reach our goal of under- 
standing how and when to use reliability models. The immediate goal is 
to try to answer the functional testing / reliability model question. 

The project RADMAS has been chosen as an experimental system [CSC]. The 
preprocessor must be used to modify the RADMAS source code. (The RADMAS 
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project and its functionally-generated acceptance tests have been made 
available for the coverage experiment.) The modified RADMAS code must be 
executed at GSFC using the functionally-generated acceptance tests. 

This experiment should answer these questions about functional 

testing and reliability models: 

-What is the percent coverage of functional testing? 

-Does functional testing meet the randomness requirements 

of the MTTF models? If not, can it be made to? 

-Do the structural metrics show any useful patterns in 

the way that functional testing tests programs? How 

does the coverage set grow? At what rate does the coverage set 
grow? 

-How independent are Individual teats from a coverage 
point of view? 

The results of this experiment will raise further questions about 
functional testing and reliability models. This will require more exper- 
imentation. If these questions are answered, there is more work to do 
concerning how and when to use reliability models. 

Software Metrics . 

The attraction of the ability to predict the effort in developing 
or explain the quality of software has led to the proposal of several 
theories and metrics [Hal77, McC76, Gaf, Che78, Cur79l. In the Software 
Engineering Laboratory, the Halstead metrics, McCabe's cyclomatic com- 
plexity and various standard metrics have been analyzed for their rela- 
tion to effort, development errors and one another [Bas82al. This study 
examined data collected from seven SEL (FORTRAN) projects and applied 
three effort reporting accuracy checks to demonstrate the need to vali- 
date a database. 
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The investigation examined the correlations of the various metrics 
with effort (functional specifications through acceptance testing) and 
development errors (both discrete and weighted according to amount of 
time to locate and fix) across several projects at once, within indivi- 
dual projects and for individual programmers across projects. 

In order to remove the dependency of the distribution of the corre- 
lation coefficients on the actual measures of effort and errors, the 
non-parametrio Spearman rank-order correlation coefficients were exam- 
ined [Ken79]. The metrics' correlations with actual effort seem to be 
strongest when modules developed entirely by individual progrsunmers or 
taken from certain validated projects are considered. When examining 
modules developed totally by individual programmers, two averages formed 
from the proposed validity ratios induce a statistically significant 
ordering of the magnitude of several of the metrics' correlations. The 
systematic application of one of the data reliability checks (the fre- 
quency of effort reporting) substantially improves either all or several 
of the projects' effort correlations with the metrics. In addition to 
these relationships, the Halst’ead metrics seem to possess reasonable 
correspondence with their estimators, although some of them have size 
dependent properties. In comparing the strongest correlations, neither 
Halstead's E metric, McCabes' cyclomatic complexity nor source lines of 
code relates convincingly better with effort than the others. 

The metrics examined in this study were calculated from primitive 
measures derived from a source analyzing program (SAP — Revision I) 
[Dec82]. An earlier version of this static analyzer implemented a less 
comprehensive definition of Halstead operators and operands[0 'Ne78 ] . 
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Some work hae been done comparing the metrics' correlations when they 
have been determined from the different interpretations of the primitive 
measures . 

This investigation has been submitted for publication to the Tran- 
sactions on Software Engineering and will appear as a University of 
Maryland Technical Report. 
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SECTION 3 - RESOURCE MODELS 

The technical papers included in this section were origi- 
nally prepared as indicated below. 

• Card, D. N., "Comparison of Regression Modeling 
Techniques for Resource Estimation," Computer 
Sciences Corporation, Technical Memorandum, 
November 1982 (reprinted by permission of the 
author ) 

• Card, D. N., "Early Estimation of Resource Expend- 
itures and Program Size," Computer Sciences Corpo- 
ration, Technical Memorandum, June 1982 (reprinted 
by permission of the author) 
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INTRODUCTION 


The development and validation of resource utilization models has 
been an active area of software engineering research. Regression 
analysis is the principal tool employed in these studies. How- 
ever, little attention has been given to determining which of the 
various regression methods available is the most appropriate. 

The objective of the study presented in this memorandum is to com- 
pare three alternative regression procedures by examining the re- 
sults of their application to one commonly accepted equation fcpr 
resource estimation. This memorandum summarizes the data studied, 
describes the resource estimation equation, explains the regres- 
sion procedures, and compares the results obtained from the pro- 
cedures . 


DATA SUMMARY 


This study is based on data collected from 22 flight dynamics soft- 
waxe projects studied by the Software Engineering Laboratory (SEL) . 
The general class of flight dynamics software includes applications 
to. support attitude determination, attitude control, maneuver plan- 
ning, orbit adjustment, and mission analysis (Reference 1). The 
specific projects selected for this analysis were developed in 
FORTRAN for operation on the Scime computer system. The range of 
system size (developed lines of source code) and development effort 
(staff-months) for these 22 projects is indicated in Table 1. 


THE RESOURCE ESTIMATION EQUATION 

■Variations of one basic equation have been incorporated in many re- 
source estimation models (Reference 2) . This equation relates proj 
ect size to development effort. Additive and/or multiplicative fac 
tors based on experience, complexity, software tape, etc. are added 
to form more sensitive models. The SEL also has developed a^ model 
based on this equation (Reference 3) . The general form of the esti 
mating equation is 

H = AL® (1) 

where 

H = staff-months of effort 
L = lines of source code 
A, B are constants 
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Table 1. Summary of Measures 
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Because the projects studied by the SEL include a substantial pro- 
portion of reused code, a "developed" lines of source code measure 
was devised to account for the higher productivity due to reusing 
code (Reference 3) . The equation for computing developed lines of 
source code is 

L » N + E + 0.2S + 0.2U (2) 

Where 

L » developed lines of source code 

N = newly coded lines 

E = extensively modified lines 

•S =* slightly modified lines 

U = lines reused unchanged . 

This software product measure (L) can be related to three measures 
of development effort (H) . These measures, as they are defined for 
the subsequent ana.lysis, are the following: 

• HP - programmer staff-months of effort 

• HPM - programmer and manager staff -months, of effort 

• HPMO - programmer, manager, and other (total) staff-months of 
effort 


ALTERNATIVE REGRESSION PROCEDURES 

Three alternative regression procedures are availabe for deriving 
values for the constants in Equation 1. These are the following; 

• Non-linear regression of original data 

• Linear regression of original data 

• Linear regression of logarithmically transformed data 

A non-linear regression procedure can find a least-squares solution 
for the constants in Equation 1 without requiring either a manipula- 
tion of the equation or a transformation of the data. Several such 
algorithms have been implemented. However, the calculation of non- 
linear solutions is computationally intensive. Thus, it consumes a 
substantial aimount of computer resources. Reference 4 describes the 
derivative- free algorithm used in this study. 

Equation 1 can be reduced to a linear form by fixing the value of 
the exponent (B) at 1.0. The resulting equation is the following: 

H = AL (3) 

Then ordinary linear least-squares regression can be applied to the 
untrans formed data. Unfortunately, this simple solution ignores the 
conceptual importance of a. potential exponential relationship between 
software size and development effort. 
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This relationship can be captured by performing a logarithmic trans- 
formation of Equation 1 and the data. The resulting equation is 

Log (H) » Log (A) + B Log (L) (4) 

Solutions for A and B in this equation can be derived by ordinary 
linear least-squares regression. Although this procedure is compu- 
tationally less intensive than the non-linear procedure, it re- 
quires a prior transformation of the data. The range of the loga- 
rithmically transformed data is shown in Table 1. 


COMPARISON OF RESULTING MODELS 


Each of the regression procedures described in the previous section 
were applied to the data for each measure of effort. These analyses 
were performed with the Statistical Analysis System software package 
(Reference 5) . Table 2 summarizes the results. The goodness-of- 
fit obtained by any regression model is measured by the mean square 
error (MSE) and correlation coefficient (R) . Onf orttinately , as shown 
in Table 2, these values are not directly comparadjle for all the re- 
gression models considered here. 

However, it is clear from Table 2 that for all measures of effort 
the results provided by the linear and logrlinear procedures are 
very similar. The estimates of A and B for the log- linear model 
.(Equation 4) are close to those of the linear model (Equation 3) ; 
slight decreases in B in the log- linear case are compensated by in- 
creases in A. Furthermore, the correlation coefficients obtained 
by the two procedures are nearly identical in all three cases. 
Therefore, the linear regression procedure produces a model as good 
as that of the log- linear procedure in a considerably more straight- 
forward manner. 

The model produced by the non-linear procedure differs considerably 
from those produced by the linear and log-linear procedures (see 
Table 2) . The values of B (Equation 1) depart significantly from 
1.0; the relationship defined is clearly exponential. Furthermore, 
the mean square error of the non-linear model is svibstantially less 
than that of the linear model. Although a direct comparison between 
the non-linear and log-linear models (in terms of MSE or R) is not 
possible, the log- linear model is so close to the linear model that 
we can safely conclude that the non-linear model is the most accurate 
of the three . 

Figvires 1 through 3 illustrate the relationships between system size 
and development effort defined by the linear and non-linear models. 
(The log-linear model is not shown because it is so similar to the 
linear model) . A cursory examination of these figures indicates 
that the linear model fits the data at the low end of the range bet- 
ter while the non-linear model fits the data at the high end of the 
range better. 
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Table 2. Comparison of Model Results 
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Not applicable 
Fixed value 

Magnitude not comparable 
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Figure 2. Comparison of Models for Programmer and Manager Staff Hours 
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Figure 3. Comparison of Models for Total Staff-Months 



This phenomenon suggests an explanation for the closeness of the 
log-linear model to the linear model. The effect of the logarith- 
mic transformation is to weight smaller data values relatively 
higher; TaUsle 1 shows that large data values are affected more 
dramatically by the transformation. Thus, the log-linear regres- 
sion procedure produces a nearly linear result because it is 
weighted in favor of smaller data values where observation indi- 
cates that the relationship between system size and development 
effort is most nearly linear. 


CONCLUSION 


The non-linear regression procedure emerges from this study as the 
superior technique. The foregoing evaluation of the three alterna- 
tive regression procedures is summarized in Table 3. The total rat- 
ing of each procedure shown in the table would be changed if the 
three elements, of which it is compos _;d (numerical accuracy, con-, 
ceptual accuracy, and computational cost) , were not weighted. equally . 

In addition to the implication for the choice. of statistical tech- 
niques, the results of the study suggest some other factors that 
should be considered in future research. The estimate of the ex- 
ponent (B) derived by each procedure is fairly constant for all 
measures of effort (see Taible 2) . The additional effort contrib- 
uted by managers and others is accounted for by an increase in the 
multiplicative factor (A) for the IIPM and HPMC measures, cf effort. 
Furthermore, the effort contributed by managers and o^er nonpro- 
grammer personnel is strongly affected by the complexity of a proj- 
ect, the experience of the development team, and the development 
methodologies employed. This confirms that these other effects should 
be represented as multiplicative factors in a comprehensive resource 
estimation model. Published models generally have taken this ap- 
proach . 

The exponential relationship, illustrated in Figures 1 through 3 
has another implication. Although the relationship between system 
size and development effort is nearly linear for small systems, the 
development effort due to size alone does not increase in proportion 
to size for large systems . This suggests that the influences of 
factors such as methodology, experience, and complexity, may be 
more important for large systems. 

The results of this study allow the optimistic conclusion ^at the 
basic relationship presented in Equation 1 provides a sufficient 
framework for the construction of comprehensive resource estimation 
models when the appropriate statistical techniques are applied. 
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Table 3* Relative Ratings of Regress 
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APPENDIX -• REGRESSION ANALYSIS RESULTS 

This appendix reproduces the computer generated output from which 
Table 2 was compiled. The following detailed tables are included: 

Table Content , 

A-1 Non-Linear Model for Programmer Staff-Months 
A- 2 Linear Model for Programmer Staff-Months 

A-3 Log-Linear for Programmer Staff-Months 

A- 4 Non-Linear Model for Programmer and Manager Staff-Months 

A-5 Linear Model for Programmer and Manager Staff-Months 

A- 6 Log-Linear Model for Programmer and Manager Staff-Months 

A-7 Non-Linear Model for Total Star.. -Months 

A- 8 Linear Model for Total Staff-Months 

A-9 Log- Linear 'Model for Total Staff-Months 
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1 . INTRODUCTION 


A substantial amount of software engineering research effort 
has been focused on the development of software cost estima- 
tion models. A concensus (of sorts) has emerged on that 
topic. The following relationship is widely accepted: 

Hg = aL^ (1) 

where H = staff-hours of effort 
s 

L = lines of code, 
a = a constant 

b = a constant 

The Software Engineering Laboratory (SEL) has devised a 
measure of lines of code based on the origin of the delivered 
code that is sxibstituted in the equation above. This is 

Ldev = N + E + 0.2 (S+0) (2) 

where * "developed" line? of code 

N = newly implemented lines of code 

E = extensively modified lines of code 
S = slightly modified lines of code 
O = old (unchanged) lines of code 

Equation 1 using "developed" lines of code has given good 
results as an estimator of development effort. (The anal- 
yses in this document are based on a sample of 20 ground- 
based attitude systems) . Table 13 shows a regression analy- 
sis that produced a correlation of 0.99 and an estimate of 
b of 1.1 when the value of a was fixed at 1.0 in Equation 1. 
Despite these encouraging results, this model has two sig- 
if leant limitations. These are the following: 

• The substantial amount of development work done in 
activities other than code implementation may not be 
adequately considered in the lines of code measure. 
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• The lines of code, whether "delivered" or "developed", 
is not known accurately until late in the development 
cycle when accurate estimates are less useful. 

The purpose of this memorandum is to discuss these limita- 
tions and to propose some alternative estimation models that 
can be used earlier in the development process, e.g., during 
requirements analysis and preliminary design. 

2. MODELS OF WORK 

The obvious alternative to lines of code as a measure of the 
work done is pages of documentation. Although only a por- 
tion of a software development team is involved in coding, 
almost everyone produces some doc\mientation. This includes 
requirements, design, and operations documents. Table 1 com- 
pares the components of developed lines of code with pages 
of documentation as estimators or programmer hours. A re- 
gession model based on the two most strongly correlated 
measures is described in Table 2. This model showed the 
following relationship ; 


Hp = 0.056 N + 4.15D (3) 

where = programmer hours 

N = newly implemented lines of code 
D = pages of docimentation 

A similar comparison is made in Table 3 for these measures 
as estimators of staff-hours (including programmer, manager, 
and other hours) . A regression model based on the two most 
strongly correlated measures is described in Table 4 . This 
model showed the following relationship: 

H_ = 0.051 N + 7.10D (4) 

W 

where H = staff-hours 
s 

N = newly implemented lines of code 
D = pages of documentation 
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The correlation coefficient (r) associated with each of the 
relationships expressed in Equations 3 and 4 was 0.97, com- 
parable to that obtained by substituting Equation 2 for L in 
Equation 1. These results suggest that the best measures of 
work done are lines of new code and pages of documentation. 
Reused lines of code do not seem to contribute directly to 
resource expenditures. However, the requirements analysis 
and design effort involved in reusing previously developed 
code may be included in the pages of documentation measure. 

Although pages of documentation appears to be an important 
measure of work, it has the same limitation as lines-of-code 
measures. Pages of dociimentation cannot be determined accur- 
ately early in the development cycle. The next sections dis- 
cusses some other measures that can be used to develop models 
for early estimation of resource expenditures and program 
size. 

3. 'MODELS FOR EARLY ESTIMATION 

Few objective measures are available early in the software 
development process. The following five measures were con- 
sidered in this analysis: 

• Number of subsystems - requirements analysis 

• Number of data sets - preliminary design 

• Complexity (PRICE-S) - preliminary design 

• Number of new modules — detailed design 

• Number of reused modules (extensively modified, slightly 
modified, and old) - detailed design 

The following sections discuss the use of these measures for 
early estimation of program size and resource expenditures. 
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3 . 1 PROGRAM SIZE 


The correlations of the measures described here with deliv- 
ered lines of code are compared in Table 5. Three regression 
models were developed (Tables 6, 7, and 8) . The two most 
useful of these are the following: 


where 



Ldei = 7596 S 

LdgjL = + 195R 

= delivered lines of code 
= nvunber of subsystems 
= number of new modules 
* nimber of reused modules 


(5) 

(6) 


Equation 5 (r = 0.99) defines an estimating relationship for 
program size that can be used during the requirements analy- 
sis phase. Equation 6 (r = 0.98) defines an estimating re- 
lationship of comparable reliability that can be used during 
the design phase. 

3.2 RESOURCE EXPENDITURES 

The correlations of the measures described here with staff- 
hoxirs of effort are compared in Tcible 9. Three regression 
models were developed (Tables 10, 11, and 12). The two most 
useful of these are the following: 

Hg = 1634 S (7) 

H„=45N+28R (8) 

s 

where H = staff-hours 
s 

S = number of subsystems 
N * number of new modules 
R = number of reused modules 

Equation 7 (r = 0.93) defines an estimating relationship for 
resource expenditures that can be used during the require- 
ments analysis phase. Equation 8 (r = 0.94) defines an 
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estimating relationship of higher reliability that can be 
used during the design phase. 

4. CONCLUSION 

The preceding analysis has demonstrated two important points. 
These are the following: 

• New measures of productivity which incorporate other 
development products besides lines of code must be in- 
vestigated. Pages of documentation is a good candi- 
date . 

• Effective estimates of program size and resource ex- 
penditures can be made using measures that are avail- 
able early in the development cycle. 
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Table 1. Components of Programmer Effort 
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Table 9, Comparison of Early Resource Estimators 
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UMMARY STATISTICS FOR MEASURES 
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Figure 1. Relationship of Modules to Size 


2 oe 
(/> (/> 


a o 

ss 


O i/t 

i§ 

3 a 

Ul Ul 

z oe 
• • 

</i 
UJ (U 

Z Z 


o a 
a> & 


z 

a 


tt 

z 


ae 

ae 


a z 
ac 




o 

<0 


I o 


» o 


i8 


• o 

♦ m 

I n 


iS 


i o 

♦ V 

I rt 


> O 
♦ « 
t n 


;8 


< o 

♦ « 


i8 


I O 
♦ « 


< o 

T £ 


I o 


i8 


♦ o 


o 

I iO 


t z z ^ o 


a ac z 


a 

(X 


♦ o 

t C4 


*► o 


</) 

Ui 

Z ( 


A 4i) m 


a 

a 

X 

3 


3-45 


NOTE: 1 DBS HIDDEN 




Figure 3. Relationship of Modules to Total Staff Effort 
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SECTION 4 - SOFTWARE MEASURES 

The technical papers included in this section were origi- 
nally published as indicated below. 

• Basili, V. R. , R. W. Selby, and T. Phillips, 

"Metric Analysis and Data Validation Across FORTRAN 
Projects," University of Maryland, Technical Report 
TR 1228, November 1982 (reprinted by permission of 
the authors) 

A version of this paper also appears in IEEE Trans- 
actions on Software Engineering , November 1983, 
vol. 9, no. 7. 

• Doerflinger, C. W. , and V. R. Basili, "Monitoring 
Software Development Through Dynamic Variables," 
University of Maryland, Technical Memorandum, 

August 1983 (reprinted by permission of the 
authors) . 

A version of this paper also appears in Proceedings 
of the Seventh International Computer Software and 
Applications Conference . New York: Computer 

Societies Press, November 1983. 

• Basili, V. R. , and B. T. Perricone, "Software Er- 
rors and Complexity: An Empirical Investigation, 

"University of Maryland, Technical Report TR-1195, 
August 1982 (reprinted by permission of the authors) 

A version of this paper will appear in Communica- 
tions of the ACM, January 1984, vol. 27, no. 1. 
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ABSTRACT 


The desire to predict the effort in developing or explain the 
quality of software has led to the proposal of several metrics in 
the literature. As a step toward validating these metricsy the 
Software Engineering Laboratory has analysed the Software Science 
metrics* cyclomatic complexity and various standard program meas- 
ures for their relation to 1) effort (including design through 
acceptance testing), 2) development errors (both discrete and 
weighted according to the amount of time to locate and fix) and 
3) one another. The data investigated are collected from a pro- 
duction FORTRAN environment and examined across several projects 
at once, within individual projects and by individual programmers 
across projects, with three effort reporting accuracy checks 
demonstrating the need to validate a database. When the data 
come from individual programmers or certain validated projects, 
the metrics' correlations with actual effort seem to be strong- 
est. For nodules developed entirely by individual programmers, 
the validity ratios induce a statistically significant ordering 
of several of the metrics' correlations. When comparing the 
strongest correlations, neither Software Science's E metric, 
cyclomatic complexity nor source lines of code appears to relate 
convincingly better with effort than the others. 
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I . Introduction 


Several metrics based on characteristics of the software 
product have appeared In the literature* These metrics attempt 
to predict the effort In developing or explain the quality of 
that software [11], [17], [19], [23]. Studies have applied them 
to data from various organizations to determine their validity 
and appropriateness [1], [13], [15]. However, the question of 
how well the various metrics really measure or predict effort or 
quality is still an issue in need of confirmation. Since 
development environments and types of software vary, individual 
studies within organizations are confounded by variations in the 
predictive powers of the metrics. Studies across different 
environments will be needed before this question can be answered 
with any degree of confidence. 

Among the most popular metrics have been the Software Sci- 
ence metrics of Halstead [19] and the cyclomatic complexity 
metric of McCabe [23]. The Software Science E metric attempts to 
quantify the complexity of understanding an algorithm. 
Cyclomatic complexity has been applied to establish quality 
thresholds for programs. Whether these metrics relate to the con- 
cepts of effort and quality depends on how these factors are 
defined and measured. The definition of effort employed in this 
paper is the amount of time required to produce the software pro- 
duct (the number of man-hours programmers and managers spent from 
the beginning of functional design to the end of acceptance test- 
ing). One aspect of software quality is the number of errors 
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reported during the product's development » and this Is the meas- 
ure associated with quality for this study. 

Regarding a metric evaluation^ there are several Issues that 
need to be addressed. How well do the various metrics predict or 
explain these measures of effort and quality? Does the correspon- 
dence Increase with greater accuracy of effort and error report- 
ing? How do these metrics compare In predictive power to simpler 
and more standard metrics » such as lines of source code or the 
number of executable statements? These questions deal with the 
external validation of the metrics. More fundamental questions 
exist dealing with the Internal validation or consistency of the 
metrics. How well do the estimators defined actually relate to 
the Software Science metrics? How do the Software Science 
metrics y the cyclomatlc complexity metric and the more tradi- 
tional metrics relate to one another? In this paper» both sets 
of Issues are addressed. The analysis examines whether the given 
family of metrics Is Internally consistent and attempts to deter- 
mine how well these metrics really measure the quantities that 
they theoretically describe. 

One goal of the Software Engineering Laboratory [6], C7]f 
[8]» [10], a joint venture between the University of Maryland, 
MASA/Goddard Space Flight Center and Computer Sciences Corpora- 
tion, has been to provide an experimental database for examining 
these relationships and providing Insights Into the answering of 
such questions. 
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The software comprising the database is ground support 
software for satellites. The systems analyzed consist of 51,000 
to 112,000 lines of FORTRAN source code and took between 6900 and 
22,300 man-hours to develop over a period of 9 to 21 months. 
There are from 200 to 600 modules (e.g., subroutines) in each 
system and the staff size ranges from 8 to 23 people, including 
the support personnel. While anywhere from 10 to 61 percent of 
the source code is modified from previous projects, this analysis 
focuses on Just the newly developed modules. 

The next section discusses the data collection process and 
some of the potential problems Involved. The third section 
defines the metrics and interprets the counting procedure used in 
their calculation. In the fourth section, the Software Science 
metrics are correlated with their estimators and related to more 
primitive program measures. Finally, the fifth section deter- 
mines how well this collection of volume and complexity metrics 
corresponds to actual effort and developmental errors. 

II . The Data 

The Software Engineering Laboratory collects data that deal 
with many aspects of the development process and product. Among 
these data are the effort to design, code and test the various 
modules of the systems as well as the errors committed during 
their development. The collected data are analyzed to provide 
Insights into software development and to study the effect of 
various factors on the process and product. Unlike the typical 


4-7 



controlled experlfflents where the projects tend to be smaller and 
the data collection process dominates the development process, 
the major concern here Is the software development process, and 
the data collectors must affect minimal interference to the 
developers . 

This creates potential problems with the validity of the 
data. For example, suppose we ■ are Interested in the effort 
expended on a particular module and one programmer forgets to 
turn in his weekly effort report. This can cause erroneous data 
for all modules the programmer may have worked on that week. 
Another problem is how does a programmer report time on the 
integration testing of three modules? Does he charge the time to 
the parent module of all three, even though that module may be 
Just a small driver? That is clearly easier to do than to propor- 
tion the effort between all three modules he has worked on. 
Another issue is how to count errors. An error that is limited to 
one module is easy to assign. What about an error that required 
the analysis of ten modules to determine that It affects changes 
in three modules? Does the programmer associate on.e error with 
all ten modules, an error with Just the three modules or one 
third of an error with each of the three?” The larger the system 

” Efforts [18], [21] have attempted to make this assignment 
scheme more precise by the explanation: a "fault" Is a specific 
manifestation in the source code of a programmer "error"; due to 
a misconception or document discrepancy, a programmer commits an 
"error" that oan result in several "faults" in the program. With 
this interpretation, whai are referred to as errors in this study 
should probably be called faults. In the interest of consistency 
with previous work and clarity, however, the term error will be 
used throughout the paper. 
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the more complicated the association. All this assumes that all 
the errors are reported. It is common for programmers not to 
report clerical errors because the time to fill out the error 
report form might take longer than the time to fix the error. 
These subtleties exist in most observation processes and must be 
addressed in a fashion that is consistent and appropriate for the 
environment * 


The data discussed in this paper are extracted from several 


sources . 


Effort data were obtained from a Component Status 


Deport that is filled out weekly by each programmer on the pro- 
ject. They report the time they spend on each module in the sys- 
tem partitioned into the phases of design* code and test, as well 
as any other time they spend on work related to the project, 
e.g., documentation, meetings, etc. A module is* defined as any 
named object in the system; that is, a module is either a main 
procedure, block data, subroutine or function. The Resource Sum- 
mary Form, filled out weekly by the project management, 
represents accounting data and records all time charged to the 
project for the various personnel, but does not break effort down 
on a module basis. Both of these effort reports are utilized in 
Section V of this paper to validate the effort reporting on the 
modules. The errors are collected from the Change Report Forms 
that are completed by a programmer each time a change is made to 
the system. While the collection of effort and error data is a 
subjective process and done manually , the remainder of the 
software measures are objective and their calculation is 
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automated 


A static code analyzing program called SAP [25] automati- 
cally computes several of the metrics examined in this analysis. 
On a module basis, the SAP program determines the number of 
source and executable statements, the cyelomatlc complexity, the 
primitive Software Science metrics and various other volume and 
complexity related measures. Computer Sciences Corporation 
developed SAP specifically for the Software Engineering Labora- 
tory and the program has been recently updated [1A] to incor- 
porate a more consistent and thorough counting scheme of the 
Software Science parameters. In an earlier study, Baslll and 
Phillips [3] employed the preliminary version of SAP in a related 
analysis. The next section explains the revised counting pro- 
cedure and defines the various metrics. 

III . Metric Definition 

In the application of each of the metrics, there exist vari- 
ous ways to count each of the entities. ' This section interprets 
the counting procedure used by the updated version of SAP and 
defines each of the metrics examined in the analysis. These 
definitions are given relative to the FORTRAN language, since 
that is the language used in all the projects studied here. The 
counting scheme depends on the syntactic analysis performed by 
SAP and is, therefore, not necessarily chosen to coincide exactly 
with other definitions of the various counts. 
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Primitive Software Science metrics Software Science 
defines the vocabulary metric n as the sum of the number of 
unique operators n1 and the number of unique operands n2. The 
operators fall Into three classes. 

1) Basic operators Include 


+-•/••=() 4 // .NE. .EQ. .LE. .LT. 

.GE. .GT. .AND. .OR. .XOR. .NOT. .EQV. .NEQV. 


11) Keyword operators Include 


IFO THEN /* 

IFO THEN ELSE /• 

IFO , , /* 

IFO THEN ENDIF /• 

IFO THEN ELSE ENDIF /* 

IFO THEN 

ELSEIFO THEN 
... ENDIF /® 

DO /* 

DOWHILE /* 

GOTO <target> /• 


GOTO (T1.'..Tn) <expr> /• 
GOTO <ldent>, (T1...Tn) /* 


<subr>( , ,*<target>) /* 
ENDS /* 
ERHs /• 
ASSIGNTO /• 
EOS /• 


logical If »/ 

logical lf>then-else •/ 

arithmetic If •/ 

block If •/ 

block If-then-else •/ 


case If •/ 
do loop •/ 
while loop */ 

unconditional goto: distinct 
targets Imply different operators •/ 
computed goto: different number of 
targets Imply. different operators •/ 
asslgne-d goto: distinct Identifiers 
Imply different operators •/ 
alternate return •/ 
read/write option •/ 
read/write option •/ 
target assignment */ 

Implicit statement delimiter */ 


111) Special operators consist of the names of subroutines , 
functions and entry points. 


Operands consist of the all variable names and constants. Note 


that the major differences of this counting scheme from that used 
by Baslll and Phillips [3] are in the way goto and If statements 
are counted. 


The metric n* represents the potential vocabulary, and 
Software Science defines it as the sum of the minimum number of 
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operators n1* and the minimum number of operands n2*. The poten- 
tial operator count n1* is equal to two; that is, n1* equals one 
grouping operator plus one subroutine/function designator. In 
this paper, the potential operand count n2* is equal to the sum 
of the number of variables referenced from common blocks, the 
number of formal parameters in the subroutine and the number of 
additional arguments in entry points. 

Source lines This is the total number of source lines that 
appear in the module, including comments and any data statements 
while excluding blank lines. 

Source lines - comments This is the difference between the 
number of source lines and the number of comment lines. 

Executable statements This is the number of FORTRAN exe- 
cutable statements that appear in the program. 

Cyclomatic complexity Cyclomatic complexity is defined as 
being the number of partitions of the space in a module's 
control-flow graph. For programs with unique entry and exit 
nodes, this metric is equivalent to one plus the number of deci- 
sions and in this work, is equal to the one plus sum of the fol- 
lowing constructs; logical If's, if-then-else 's , block-if's, 
block if-then-else 's , do loops, while loops, AND's, OR's, XOR's, 
EQV's, HEQV's, twice the number of arithmetic if's, n - 1 deci- 
sion counts for a computed goto with n statement labels and n 
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decision counts for a case If with n predicates. 

A variation on this definition excludes the counts of AND's, 
OR's, XOR's, EQV's and NEQV's (later referred to as 

Cyclo_cmplx_2) . 

Calls This Is the number of subroutine and function Invo- 
cations In the module. 


Calls and jumps This Is the total number of calls and 
decisions as they are defined above. 


Revisions This Is the number of versions of the module 
that are generated In the program library. 


Changes This Is the total number of changes to the system 
that affected this module. Changes are classified Into the fol- 
lowing types (a single change can be of more than one type): 

a. error correction 

b. planned enhancement 

c. Implement requirements change 

d. Improve clarity 

e. Improve user service 

f. debug statement Insertlon/deletlon 

g. optimization 

h. adapt to environment change 
1. other 


Weighted changes This is a measure of the total amount of 
effort spent making changes to the module. A programmer reports 
the amount of effort to actually Implement a given change by 
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indicating either 

a. lesa than one hour, 

b. one hour to a day, 

o. one day to three days or 

d. over three days. 

The respective means of these durations, 0.5, ^.5, 16 and 32 
hours, are divided equally among all modules affected by the 
change. The sum of these effort portions over all changes 
Involving a given module defines the weighted changes for the 
module. 

Errors This is the total number of errors reported by pro- 
grammers; l.e., the number of system changes that listed this 
module as involved in an error correction. (See the footnote at 
the bottom of page 4 regarding the usage of the term "error".} 

Weighted errors This is a measure of the total amount of 
effort spent isolating and fixing errors in a module. For error 
corrections, a programmer also reports the amount of effort spent 
Isolating the error by indicating either 

a. less than one hour, 

b. one hour to one day, 

c. more than one day or 

d. never found. 

The representative amounts of time for these durations, 0.5, 4.5, 
16 and 32 hours, are combined with the effort to implement the 
correction (as calculated earlier) and divided equally among the 
modules changed. The sum of these effort portions over all error 
corrections Involving a given module defines the weighted errors 
for the module. 
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IV . Internal Validation of the Software Science Metrics 

The purpose of this section Is to briefly define the 
Software Science metrics, to see how these metrics relate to 
standard program measures and to determine if the metrics are 
Internally consistent. That is, Software Science hypothesizes 
that certain estimators of the basic parameters, such as program 
length N and program level L, can be approximated by formulas 
written totally In terms of the number of unique operators and 
operands. Initially, an attempt is made to find correlations 
between various definitions of these quantities based on the 
interpretations of operators and operands given in the previous 
section. Then, the family of metrics that Software Science pro- 
poses is correlated with traditional measures of software. 

Program length Program length N is defined as the sum of 
the total number of operators N1 and the total number of operands 
N2; l.e., N s N1 N2. Software Science hypothesizes that this 
can be approximated by an estimator that is a function of the 
vocabulary, defined as 

= n1log2(n1) + n21og2(n2). 

The scatter plot appearing In Figure 1 and Pearson correlation 
coefficient of .899 (p < .001; 179^ modules)” show the relation- 
ship between N and N'‘ (polynomial regression rejects including a 
second degree term at p s .05). Several sources [12], [16], 
[26], [27] have observed that the length estimator tends to be 

” The symbol p will be used to stand for significance level. 
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high for small programs and low for large programs. The correla- 
tions and significance levels for the pairwise Wilcoxon statistic 
[20] y broken down by executable statements and length, are 
displayed in Table 1. In our environment, either measure of size 
demonstrates that N* significantly overestimates N in the first 
and second quartlles and underestimates it (most significantly) 
in the fourth quartile. Feuer and Fowlkes [15] assert that the 
accuracy of the relation between the natural logarithms of 
estimated and observed length changes less with program size. The 
scatter plot appearing in Figure 2 and correlation coefficient 
for In H vs. In of .927 (p < .001; 1794 modules) show moderate 
improvement. 

<< Figure 1 >> 


Table 1. Observed vs. 

estimated 

length broken 

down by program slzo 

a. N vs. N“ 

broken 

down by executable statments. 

XQT STMTS 

MOOS 

R- 

ESTIMATION 

WILCOXON SIGNIF 

0-19 

446 

.601 

over 

<<.0001 

20 - 40 

442 

.511 

over 

<<.0001 

41 - 78 

457 

.478 

under 

.0367 

79 < = 

449 

.751 

under 

<<.0001 

b. H vs. 

” Length'll 

broken 

MOOS 

down by H 
R* 

e 

ESTIMATION 

WILCOXON SIGNIF 

0 - 114 

449 

.750 

over 

<<.0001 

115 - 243 

445 

.447 

over 

<<.0001 

244 - 512 

453 

.348 

under 

.0010 

513 <s 

447 

.731 

under 

<<.0001 


(p < .001) 


<< Figure 2 >> 



Program volume A program volume metric V defined as 


N 


log2 n represents the size of an implementation, which can be 
thought of as the number of bits necessary to express it. The 
potential volume 7* of an algorithm reflects the minimum 
representation of that algorithm in a language where the required 
operation is already defined or Implemented. The parameter V* is 
a function of the number of input and output arguments of the 
algorithm and is meant to be a measure of its specification. The 
metric V* is defined as 

V» a (2 ♦ n2«) log2 (2 ♦ n2*). 

The correlation coefficient for V vs. V* of .670 (p < .001 j 179^ 
modules) shows a reasonable relationship between a program's 
necessary volume and its specification. 

Program level The program level L for an algorithm is 
defined as the ratio of its potential volume to the size of its 
implementation, expressed as 

L = V»/V. 

Thus, the highest level for an algorithm is its program specifi- 
cation and there L has- value unity. The larger the size of the 
required implementation V, the lower the program level of the 
implementation. Since L requires the calculation of V*, which is 
not always readily obtainable, Software Science hypothesizes that 
L can be approximated by 


L* 


2 n2 
n1 N2 
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The correlation for L va . of .531 (p < .001; 179^ 

modules) Is disappointingly below that of .90 given In [19l« 
Hoping for an increase in the correlations, the modules are par- 
titioned by the number of executable statements in Table 2. 
Although the upper quartiles show measured Improvement over the 
correlation of the whole sample, a more Interesting relationship 
surfaces. The level estimator significantly underestimates the 
program level in the second, third and fourth quartiles, with the 
hypothesis being rejected in the first quartlle. The increase in 
magnitude of the n2* parameter does not appear to be totally cap- 
tured by the definition of L^. 

Table 2. Relationship of observed vs. estimated program level 
” broken down by program size . 

XQT STMTS MODS R" ESTIMATION WILCOXON SIGNIF 

0-19 .484 

20 - 40 442 .672 under <<.0001 

41 - 78 457 .597 under <<.0001 

79 <= 449 .615 under <<.0001 

all 1794 .531 under <<.0001 

- (p < .001) 

Program difficulty The program difficulty D is defined _as 
the difficulty of coding an algorithm. The metric D and the pro- 
gram level L have an inverse relationship; D is expressed 

D s 1/L . 

An alternate Interpretation of difficulty defines it as the 
inverse of L*, given by 
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1 


n1N2 


D2 . 

L* 2 n2 

Christensen, Fltsos and Smith [12] demonstrate that the unique 
operator count n1 tends to remain relatively constant with 
respect to length for 490 PL/S programs. They propose that the 
average operand usage N2/n2 Is the main contributor to the pro- 
gram difficulty 02. The scatter plot appearing In Figure 3 and 
Pearson correlation coefficient of .729 (p < .001; 1794 modules) 
display the relationship between N2/n2 and 02 for our FORTRAN 
modules. The application of polynomial regression brings In a 
second degree term (p < .001) and results In a correlation of 
.738. 


<< Figure 3 >> 

However, after observing In Figure 4 that n1 varies with program 
size. It seems as If the nl's Inflation might possibly better 
explain 02. The scatter plot appearing In Figure 5 and the 
correlation of .865 (p < .001; 1794 modules) show the relation- 
ship of 02 vs. n1. Step-wise polynomial regression brings In a 
second degree term Initially, followed by a linear term (p < 
.001), and results In a correlation of .879. In our environment, 
the unique operator count n1 explains a greater proportion of the 
variance of the difficulty 02 than the average operand usage 
H2/n2. 

<< Figure 4 >> 
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<< Figure 5 >> 


Program effort The Software Science effort metric E 
attempts to quantify the effort required to comprehend the imple- 
mentation of an algorithm. It is defined as the ratio of the 
volume of an implementation to its level* expressed as 

7 ( V )**2 

L V* 

The B metric Increases for programs Implemented with large 
volumes or written at low program levels; that is* it varies with 
the square of the volume. An approximation to B can be obtained 
without the knowledge of the potential volume by substituting L‘* 
for L in the above equation. The metric 

7 n1 N2 7 n1 N2 N lo.g2 n 

E“ 

L* 2 n2 2 n2 

defines the product of one half the number of unique operators, 
the average operand usage and the volume. In an attempt to 
remove the effect of possible program impurities C9], [19], is 
substituted for H in the above equation, yielding 

N" log2 n n1 N2 (n11og2n1 ♦ n21og2n2) log2 n 

s — s - .... ... . 

L" 2 n2 

The correlation coefficients for E vs. S'* , E vs. E'*'*, In E vs. In 
E" and In E vs. In E*'' are given in Table 3a. A fit of a least 
squares regression line to the log-log plot of E vs. E“ produces 
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the equation 


In E s .830*ln E" ♦ 1.357 . 

Equivalently , 

E = exp(1.357) * (E*)««0.830 . 

Due to this non-linear relationship and the Improved correlation 
of In E vs. In E*, the nodules are partitioned by executable 
statements in Table 3b. The application of polynomial regression 
confirms this non-linearity by bringing in a second degree term 
(p < .001) » resulting in a correlation of .698. In Table 3bt 
notice that the correlations seem substantially better for 
modules below median size. The significant overestlmatlon in the 
upper three quartiles attributes to the relationship of L and L* 
described earlier. 


Table Observed vs . estimated Software Science E metric . 

a. Pearson Correlation (£ < . 001 ; 1794 modules ) . 

R 

E vs. E* .663 

In E vs. In E'“ .931 

E vs. E*'* .603 

In E vs. In E'*“ .890 


b. E vs» S'* broken down by executable statements . 


XQT stmts' 

MODS 

R- 

ESTIMATION 

vilLCOXON SIGNIF 

0-19 

446 

.708 

under 

.0050 

20 - 40 

442 

.709 

over 

<<.0001 

41 - 78 

457 

.411 

over 

<< .0001 

79 <s 

449 

.550 

over 

<< .0001 

’ (P < 

.001 ) 




Program bugs 

Software 

Science 

defines the 

bugs metric B as 


the total number of "delivered" bugs in a given implementation. 
Not to be confused with user acceptance testing, the metric B is 
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the number of Inherent errors in a system component at the com- 
pletion of a distinct phase In its development. Bugs B is 
expressed by 


B = 



? 

Bo 


where Bo is theoretically equivalent to the mean number of ele- 
mentary discriminations between potential errors in programming. 
Through a calculation that employs the definitions of B» L and 
lambda (lambda s LV* is referred to as the language level) > this 
equation becomes 


B 


(lambda)*«1/3 (E)*»2/3 
Bo 


The derivation determines an Bo value of 3000 » assumes 
( lambda) •• 1 /3 1 and obtains 

(E)»*2/3 

3000 

The correlation for B vs. B* is .789 (p < .001 j 179^ modules). 

In summary, the relationship of some q-f the Software Science 
metrics with their estimators seems to be program size dependent. 
Several observations lead to the result that the metric N" signi- 
ficantly overestimates N for modules below the median size and 
underestimates for those above the median size. The level estima- 


tor L* seems to have a moderate correlation with L, and its sig- 



nlflcant underestimation of L in the upper three quartiles 
reflects its failure to capture the magnitude of n2* in the 
larger modules. With respect to the E metric, the effort estima- 
tor E“ correlates better over the whole sample than E'“*, and 
their strongest correlations are for modules below median size. 
The estimator E* shows a non-linear relationship to the effort 
metric E. The correlation of In E vs. In E'* significantly 
Improves over that of E vs. E*, with the E^ metric's overestima- 
tion of E for larger modules attributing to the role of L* in its 
definition. With the above family of metrics, Software Science 
attempts to quantify size and complexity related concepts that 
have traditionally been described by a more fundamental set of 
measures . 

Table 4 displays the correlations of the Software Science 
metrics with the classical program measures of source lines of 
code, cyclomatlc complexity, etc. There are several observations 
worth noting. Length N and volume V have remarkably similar 
correlations and correspond quite well with most of the program 
measures. Several of the metrics correlate well with the number 
of executable statements, especially the program "size" metrics 
of N1, N2, N and V (also B). The level estimator L** and its 
Inverse 02 seem to be much more related to the standard size and 
complexity measures than their counterparts L and 01. The 
language level lambda does not seem to show a significant rela- 
tionship to the standard size and complexity measures, as 
expected. The E'“* metric relates best with the number of execut- 
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able statements and the modified eyelomatic complexity, while 
correlating with all the measures better than the B metric and 
slightly better than E*. None of the Software Science measures 
correlate especially well with the number of revisions or the sum 


Table Comparison of Software Science metrics against more 
” traditional software measures . 

Key: ? not significant at *05 level 

* significant at .05 level 

a significant at .01 level 

otherwise significant at .001 level 


Source Lines Source-Cmmts Cyclo cmplx 2 Calls & Jumps 

r I T ■ r 

I Execut Stmts I Cyclo cmplx I Revisions I Calls 


n1 

.776 

.854 

.778 

.796 

.818 

.361 

.802 

.542 

n2 

.352 

.867 

.853, 

.767 

.774 

.430 

.809 

.614 

N1 

.824 

.964 

.868 

.881 

.889 

.328 

.869 

.552 

N2 

.826 

.949 

.871 

.858 

.870 

.355 

.870 

.597 

n2* 

.792 

.691 

.754 

.635 

.629 

.501 

.683 

.541 

N 

.829 

.961 

.873 

.874 

.884 

.343 

.874 

.577 


.864 

.897 

.364 

.800 

.811 

.420 

.836 

.621 

7 - 

.837 

.962 

.875 

.873 

.883 

.343 

.876 

.584 

7* 

.776 

.677 

.734 

.618 

.611 

.485 

.664 

.525 

L 

-.098 

-.179 

-.112 

-.170 

-.173 

7 

.158 

-.083 

L- 

-.383 

-.411 

-.394 

-.389 

-.396 

-.216 - 

.386 

-.250 

D1=1/L 

.067a 

.244 

.113 

.178 

.196 

-.093 

.134 

7 

D2s1/L" 

.696 

.872 

.745 

.816 

.839 

.269 

.791 

.478 

N2/n2 

.365 

.544 

.437 

.508 

.517 

.106 

.470 

.241 

Lambda 

.136 

? 

.108 

? 

? 

.134 

? 

.051* 

B 

.439 

. 629 

.500 

.535 

.556 

.106 

.506 

.282 

B" 

.663 

.831 

.711 

.771 

.797 

.224 

.748 

.452 

E** 

.738 

.871 

.760 

.799 

.829 

.268 

.788 

.501 

B - 

.837 

.962 

.875 

.873 

.883 

.343 

.876 

.584 

B" 

.546 

.749 

.610 

.650 

.670 

.149 

.620 

.355 

" B and 

7 will have ide 

ntical 

correlations 

since they 

are 

1 Inear 


functions of one another. 
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of procedure and function calls. The primary measures of unique 
operators n1 and unique operands n2 correspond reasonably well 
overall with n2 being stronger with source lines and n1 stronger 
with the cyclomatic complexities. In the next section, an 
analysis attempts to determine the relationship that these param- 
eters really have with the quantities that they theoretically 
describe. 

V. External Validation of the Software Science and Related Metrics 

The purpose of this section is to determine how well the 
Software Science metrics and various complexity measures relate 
to actual effort and errors encountered during the development of 
software in a commercial environment. These objective product 
metrics are compared against more primitive volume metrics, such 
as lines of source code. The reservoir of development data 
includes the monitoring of several projects and the analysis 
examines several projects at once, individual projects and indi- 
vidual programmers across projects. To remove the dependency of 
the distribution of the correlation coefficient on the actual 
measures of effort and errors, the nonparametric Spearman rank 
order correlation coefficients are examined in this section [22]. 
(The ability of a few data points to artificially inflate or 
deflate the Pearson product-moment correlation coefficient is 
well recognized.) The analysis first examines how well these 
measures correspond bo the total effort spent in the development 
of software. 
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A. Metrics * Relation to Actual Effort 


Initially, a correlation across seven projects of the 
Software Science E metric vs. actual effort, on a module by 
module basis using only those that are newly developed, produces 
the results In Table 5. The table also displays the correlations 
of some of the more standard volume metrics with actual' effort. 
These disappointingly low correlations create a fear that there 


Table 5. Spearman rank order correlations Rs with effort for 
all modules ( 731 ) from all projects . 


Key: ? 


a 

otherwise 


not significant at .05 level 
significant at .05 level 
significant at .01 level 
significant at .001 level 


E 

.345 

E* 

.445 

E“* 

.488 

Cyclo_cmplx 

.463 

Cyclo”cmplx_2 

.467 

Calls” 

.414 

Calls & Jumps 

.494 

Disl/L 

.126 

D2=1/L* 

.417 


Source_Llnes 

.522 

Execut_Stmts 

.456 

Source-Cmmts 

.460 

V 

.448 

N 

.434 

etal 

.485 

eta2 

.461 


B 

.448 

B* 

.345 

Revisions 

.531 

Changes 

.469 

Welghted_Chg 

.468 

Errors ” 

.220 

Weighted_Err 

.226 


4-26 


may be some modules with poor effort reporting slcewlng the 
analysis. Since there Is partial redundancy built Into the effort 
data collection process, there exists hope of validating the 
effort data. 

Validation of effort data The partial redundancy In the 
development monitoring process Is that both managers and program- 
mers submit effort data. Individual programmers record time spent 
on each module, partitioned by design, code, test and support 
phases, on a weekly basis with a Component Status Report (CSR). 
Managers record the amount of time every programmer spends work- 
ing each week on the project they are supervising with a Resource 
Summary Form (RSF). Since the latter form possesses the enforce- 
ment associated with the distribution of financial resources. It 
Is considered more accurate [24]. However, the Resource Summary 
Form does not break effort down by module, and thus a combination 
of the two forms has to be used. 

Three different possible effort reporting validity cheeks 
are proposed. All employ the idea of selecting programmers that 
tend to be good effort reporters, and then using Just the modules 
that only they worked on In the metric analysis. The three pro- 
posed effort reporting validity checks are: 

number of weekly CSR's submitted by programmer 

a. Vm = 

number of weeks programmer appears on RSF's 
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sum of all man-hours reported by programmer on all GSR's 

b. Vt s " 

sum of all man-hours reported for programmer on all RSF s 

number of weeks programmer's GSR effort > RSF effort 

c. 71 s 1 - 

total number of weeks programmer active in project 

The first validity proposal attempts to capture the frequency of 
the programmer's effort reporting. It checks for missing data by 
ranking the programmers according to the ratio Vm of the number 
of Gomponent Status Reports submitted over the number of weeks 
that the programmer appears on Resource Summary Forma. The second 
validity proposal attempts to capture the total percentage of 
effort reported by the programmer. This proposal ranks the pro- 
grampers according to the ratio Vt formed by the sum of all the 
man-hours reported on Gomponent Status Reports over the sum of 
all hours delegated to him on Resource Summary Forms. 

Note that for a given week, the amount of time reported on a 
Gomponent Status Report should be always less than or equal to 
the amount of time reported on the corresponding Resource Summary 
Form. This is not because the programmer fails to "cover" him- 
self, but a consequence of the management's encouragement for 
programmers to realisticly allocate their time rather than to 
guess in an ad hoc manner. This observation defines a third vali- 
dity proposal to attempt to capture the frequency of a 
programmer's reporting of Inflated effort. This data check ranks 
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the programmers according to the quantity Vi equal to one minus 
the ratio of the number of weeks that CSR effort reported 
exceeded RSF effcrt over the total number of weeks that the pro- 
grammer is active in the project. 

Metrics * relation to validated effort data Of the given 
proposals^ the systems development head of the institution where 
the software is being developed suggests that the first proposal, 
the missing data check, would be a good initial attempt to select 
modules with accurate effort reporting [24], The missing data 
ratios Vm are defined for programmers on a project by project 
basis. Table 6 displays the effort correlations of the newly 
developed modules worked on by only programmers with Vm >s 90K 
from all projects, those with Vm >= 80< and for all newly 
developed modules. Most of the correlations of the modules 
Included in the Vm > = 90J level seem to show improvement over 
those at the Vm >= 80$ level. Although this is the desired effect 
and several of the Vm > = 90$ correlations Increase over the ori- 
ginal values, a majority of the correlations with modules at the 
Vm >s 80$ level are actually lower than their original coeffi- 
cients. Since the effect of the ratio's screening of the data is 
inconsistent and the overall magnitudes of the correlations are 
low, the analysis now examines modules from different projects 
separately . 
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Tabla 6. Soearnan rank order correlations Ha with effort for nodules 
“ across seven projects with various validity levels . 


Key: ? 

t 


a 

otherwise 


not significant at .05 Ihvel 
significant at .05 level 
significant at .01 level 
significant at .001 level 


Validity ratio Vm (#nods) 



all(73D 

80^(398) 905(2 

e 

.345 

.307 

.357 

E* 

.445 

.422 

.467 


.488 

.480 

.513 

Cyolo_cmplx 

.463 

.457 

.479 

Cyclc”cmplx_2 

.467 

.454 

.506 

Calls"" 

.414 

.360 

.402 

Calls i Jumps 

.494 

.475 

.479 

Disl/L “ 

.126 

.088* 

? 

D2a1/L* 

.417 

.371 

.421 

Source_^Llnes 

.522 

.519 

.501 

Bzecut_Stmts 

.456 

.429 

.475 

Souroe-Cmmts 

.460 

.420 

.439 

7 

.448 

.434 

.475 

n 

.434 

.416 

.460 

etal 

.485 

.462 

.493 

eta2 

.461 

.467 

.503 

B 

.448 

.434 

.475 

B* 

.345 

.307 

.357 

Revisions 

.531 

.580 

.565 

Changes 

.469 

.495 

.385 

Weighted_,Chg 

.468 

.521 

' .462 

Errors *" 

.220 

.381 

.205 

Weighted_Err 

.226 

.382 

.247 


The Spearman correlations of the various metrics with effort 
for three of the individual projects appear In Table 7. 
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Table 7. Spearman rank order correlations Rs with effort for 

various validity rankings of modules from Individual 
projects SI , S3 and S7 » 

Key: ? not significant at .05 level 

* significant at .05 level 

a significant at .01 level 

otherwise significant at .001 level 
z unavailable data 


Project 




SI 


S3 

•• 

S7 

— — 

Validity ratio 








Vm 

all 

80!( 

90J 

80% 

90% 

all 

80% 

fmodules 

79 

29 

20 

132 

81 

127 

49 

B 

.613 

.647 

.726 

.469 

.419 

.285 

. 409a 

B" 

.665 

.713 

.746 

.602 

.585 

.389 

.569 


.700 

.747 

.798 

.638 

.640 

.430 

.567 

Cyclo_cmplx 

.757 

.774 

.792 

.583 

.608 

.463 

.523 

Cyclo~cmplx 2 

.764 

.785 

.787 

.609 

.664 

.491 

.523 

Calls” 

.681 

.698 

.818 

.442 

.492 

.404 

.485 

Calls & Jumps 

.776 

.813 

.822 

.594 

.619 

.488 

.569 

D 1 s 1 /L 

.262a 

? 

? 

.156* 

? 

? 

? 

D2a1/L" 

.625 

.681 

.745 

.507 

.442 

.377 

.499 

Source_Lines 

.686 

.672 

.729 

.743 

.734 

.486 

.499 

Execut^Stmts 

.688 

.709 

.781 

.609 

.594 

.408 

.515 

Source-Cmmts 

.670 

.710 

.778 

.671 

.654 

.416 

.471 

V 

.657 

.692 

.774 

.627 

.637 

.377 

.497 

N 

.653 

.680 

.755 

.613 

.619 

.360 

.484 

etal 

.683 

.740 

.848 

.553 

.533 

.439 

.431 

eta2 

.667 

.701 

.747 

.643 

.698 

.365 

. 445 

B 

.657 

.692 

.774 

.627 

.637 

.377 

.497 

B* 

.613 

.643 

.726 

.469 

.419 

.285 

.409a 

Revisions 

.677 

.717 

.804 

.655 

.632 

.449 

.510 

Changes 

.687 

.645 

.760 

.672 

.639 

.238a 

.380a 

Weighted_Chg 

.685 

.629 

.749 

.673 

.649 

.238a 

.256» 

Errors 

z 

z 

z 

.644 

.611 

.253a 

.438 

Weighted__Err 

z 

z 

z 

.615 

.605 

.245a 

.276» 

' All modules 

In project 

S3 were 

developed by 

programmers 


with Vm >s 80>. 


*" There exist fewer than a significant number of modules developed 
by programmers with Vm >= 901C. 
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Although tha correlation ooefficlents vary considerably between 
and among the projects, the overall improvement in projects SI 
and S3 is apparent. Almost every metric's correlation with 
development effort increases with the more reliable data in pro- 
jects 31 and S7. When comparing the strongest correlations from 
the. seven Individual projects, neither Software Science's E 
metrics, cyclomatic complexity nor source lines of code relates 
convincingly better with effort than the others. Note that the 
estimators of the Software Science E metric, E“ and B*"*, appear 
to show a stronger relationship to actual effort than E. 

The validity screening process substantially improves the 
correlations for some projects, but not all. This observation 
points toward the existence of project dependent factors and 
interactions. In an attempt to minimize these intraproject 
effects, the analysis focuses on individual programmers across 
projects. Note that Basill and Hutchens [2] also suggest that 
programmer differences have a large effect on the results when 
many individuals contribute to a project. 

The use of nodules developed solely by individual program- 
mers significantly reduces the number of available data points 
because of the team nature of commercial work. Fortunately, how- 
ever, there are five programmers who totally developed at least 
fifteen modules each. The correlations for all modules developed 
by them and their values of the three proposed validity ratios 
are given in Table 8. The order of Increasing correlation coef- 
ficients for a particular metric can be related to the order of 
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Table 8, Soearnan rank order correlations Rs with effort for nodules 
~ totally deyeloped by five Individual programmers . 


Key: ? 

not 

significant at .05 

level 



significant 

at .05 level 


a 

significant 

at .01 level 


otherwise significant 

at .001 level 




Programmer (#mods) 



PK31) 

P2( 17) 

P3(21) 

P4(24) 

P5(15) 

E 

.593 

? 

? 

.561a 

7 

B" 

.718 

.526* 

.375* 

.555a 

.507* 

B^^ 

.t89 

.570a 

? 

.539a 

.511* 

Cyelo_cmplx 

.592 

.469* 

.521a 

.565a 

7 

Cyelo”cmplx_2 

.684 

.583a 

.481* 

.546a 

7 

Calls” 

.622 

.787 

7 

.669 

7 

Calls i Jumps 

.701 

. 604a 

.451* 

.579a 

7 

Dial/L ” 

.314* 

? 

7 

7 

7 

D2a1/L“ 

.713 

.460* 

7 

.497a 

.467* 

Souree_Llnes 

.863 

.682 

.605a 

.624 

7 

Bxeeut^Stmts 

.747 

.540* 

.436* 

.631 

.534* 

Souree-Cmmts 

.826 

.576a 

.530a 

.612 

.509* 

V 

.718 

.540* 

.453* 

.579a 

.451* 

N 

.676 

.526* 

.461* 

.556a 

.471* 

etal 

.81 1 

.575a 

7 

.536a 

7 

eta2 

.765 

.701 

.527a 

.597 

7 

B 

.718 

.540* 

.453* 

.579a 

.451* 

B* 

.593 

? 

7 

.561a 

7 

Revisions 

.675 

.523* 

.777 

.468* 

7 

Changes 

.412* 

.468* 

. 600a 

7 

7 

Welghted^Chg 

.428a 

.527* 

.502a 

7 

7 

Errors ~ 

.386* 

? 

.668 

7 

.596a 

Welghted_Brr 

.342* 

? 

.624 

7 

.545* 

VALIDITY RATIOS 

(%) 





Vn 

92.5 

96.0 

87.7 

83.9 

74.1 

Vt 

97.9 

91.8 

98.8 

82.1 

74.1 

VI 

78.6 

69.5 

77.6 

80.0 

87.5 

Ave. Vn,Vt 

95.2 

93.9 

93.25 

83.0 

74.1 

Ave. Vn,Vl 

85.5 

82.75 

82.65 

81.95 

80.3 
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increasing values for a given validity ratio using the Spearman 
rank order correlation. The significance levels of these rank 
order correlations for several of the metrics appear In Table 9. 

The statistically significant correspondence between the program- 
mers' validity ratios 7m and the correlation coefficients Justi- 
. fies the use of the ratio 7m in the earlier analysis; possible 
Improvement Is suggested If 7m were combined with either of the 
other two ratios. 

Table 9. Significance levels for the Spearman rank order correlation 
~ between the programmer 'a validity ratios and the eorrelati 
coefficients for several of the metrics. 


Ratio 


Metric 

7m 

7t 

71 

Ave(7m,7t) 

Ave(7m,7i) 

Ave (7t 





.09 

.09 


Cyclo_cmplx 






.05 

Cy c 1 o_cmp 1 x_2 

.05 



.02 

.02 


Calls”4_Jumps 

.05 



.02 

.02 


Source_^Lines 

.05 



.02 

.02 


Source-Cmmts 




.09 

.09 


7 (B) 




.09 

.09 


eta2 

in 

o 

e 



.02 

.02 


Revisions 


.001 

1 

o 

e 

.09 

.09 



" Negative correlation. 


In summaryi the strongest sets of correlations occur between 
the metrics and actual effort for certain validated projects and 
for modules totally developed by individual programmers. While 
relationships across all projects using both all modules and only 
validated modules produce only fair coefficients, the validation 
process shows patterns of improvement. Applying the validity 
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ratio screening to individual projects seems to filter out some 
of the project specific interacticns while not affecting others, 
with the correlations improving accordingly. Two averages of the 
validity ratios (Vm with Vt and 7m with 71) impose a ranking on 
the individual programmers that statistically agrees with an ord- 
ering of the Improvement of several of the correlations. In all 
sectors of the analysis, the inclusion of in the Software Sci- 
ence E metric in its estimators E"' and E'"* seems to improve the 
metric correlations with actual effort. The analysis now attempts 
to see how well these metrics relate to the number of errors 
encountered during the development of software. 

B. Metric ‘’s Relation to Errors 

This section attempts to determine the correspondence of the 
Software Science and related metrics both to the number of 
development errors and to the weighted sum of effort required to 
Isolate and fix the errors. A correlation across all projects of 
the Software Science bugs metric B and some of the standard 
volume and complexity metrics with errors and weighted errors, 
using only newly developed modules, produces the results in Table 
10. Most of the correlations are very weak, with the exception 
of system changes. These disappointingly low correlations attri- 
bute to the discrete nature of error reporting and that 340 of 
the 652 modules (52t) have zero reported errors. Even though 
these correlations show little or no correspondence, the follow- 
ing observations indicate potential improvement. 
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Table 10. S pearnan rank order eorrelatlona Rs with errors and 

weighted - errors for all nodules ( 652 ) from six projects . 


Key: ? 

t 


a 

otherwise 


not significant at .05 level 
significant at .05 level 
significant at .01 level 
significant at .001 level 


Errors Welghted_err • 


E .083* 

. 101a 

E* .151 

.171 

E*“ .163 

.186 

Cyclo_cmplx .196 

.205 

Cyclo cmplx 2 .189 

.200 

Calls" " .220 

.236 

Calls & Jumps .235 

.248 

D 1 a 1 /L " ? 

7 

D2a1/L“ .124 

.140 


Source^Llnes 

.255 

.265 

Execut^Stmts 

.177 

.198 

Source-Cmmts 

.288 

.298 

7 

.168 

.186 

N 

.162 

.180 

etal 

.102a 

.132 

eta2 

.181 

.199 


B 

.168 

.186 

B* 

.083* 

.101a 

Revisions 

.375 

.375 

Changes 

.677 

.636 

Welghted_Chg 

.627 

.677 

Design Eff 

.219 

.185 

Code Eff 

.285 

.316 

Test“Eff 

.149 

.164 

Tot_Effort 

.324 

.332 


Project SI has no data to distinguish errors from changes. 


Weiss C4]f C5] conducted an extensive error analysis that 
involved three of the projects and employed enforcement of error 
reporting through programmer interviews and hand-checks. For two 


4-36 



of the more recent projects, Independent validation and verifica- 
tion was performed. In addition, the on-site systems development 
head asserts that due to the maturity of the collection environ- 
ment, the accuracy of the error reporting is more reliable for 
the more recent projects [24]. These developmental differences 
provide the motivation for an examination of the relationships on 
an individual project basis. 


Table 11 displays the attributes of the projects and the 
correlations of all the metrics vs. errors and weighted errors 
for three of the individual projects. The correlations in S7 , a 
project involved in the Weiss study, are fair but better than 
those of project S5 (not shown) that was developed at about the 
same time. Project S4 and S6 (also not shown) have very poor 
overall correlations and unreasonably low relationships of revi- 
sions with errors, which point to the effect of being early pro- 
jects in the collection effort. The trend that the attributes 
produce is not very apparent, although chronology and error 
reporting enforcement do seem to have some effect. In another 
attempt to Improve the correlations, the analysis applies the 


Table 11 . Spearman rank order correlations Rs with errors and 
weighted-errors for modules from three individual 


Key: ? 


a 

otherwise 


not significant at .05 level 
significant at .05 level 
significant at .01 level 
significant at .001 level 


projects . 


Err errors 

W err weighted-errors 
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Project (#moda) 



S3(1 

32) 

S4(35) 

37(127) 


Err 

W_err 

Err 

W_err 

Err 

W_err 

E 

.401 

.378 

? 

? 

.397 

.391 

B* 

.536 

.482 

? 

? 

.507 

.503 

g-- 

.579 

.522 

? 

7 

.492 

.505 

Cyclo_cmplx 

.542 

.481 

? 

7 

.393 

.368 

Cyclo_cmplx_2 

Calls 

.553 

.445 

.489 

.432 

? 

.300* 

7 

.316* 

.405 

.423 

.400 

.419 

Calls & Jumps 

.566 

.518 

? 

7 

.432 

.412 

D 1 s 1 /L 

? 

? 

? 

7 

.168* 

.178* 

D2=1/L“ 

.491 

.426 

? 

7 

.563 

.559 

Source__^Lines 

.648 

.622 

.339* 

7 

.490 

.487 

Execut_Stmts 

.538 

.505 

? 

7 

.478 

.465 

Source-Cmmts 

.599 

.568 

? 

7 

.501 

.483 

V 

.541 

.495 

7 

7 

.461 

.456 

N 

.526 

.480 

? 

7 

.457 

.449 

etal 

.550 

.500 

? 

7 

.488 

.522 

eta2 

.541 

.500 

? 

7 

.348 

.367 

B 

.541 

.495 

? 

7 

.461 

.456 

B" 

.401 

.378 

? 

7 

.396 

.390 

Revisions 

.784 

.694 

. 686 

.630 

.567 

.500 

Changes 

.939 

.864 

.770 

.761 

.727 

.670 

Weighted_Chg 

.840 

.885 

.661 

.757 

.624 

.714 

Design_Eff 

? 

0 


7 

7 

7 

Code_Eff 

.620 

.632 

.413a 

.398a 

.274 

.264 

Test Eff 

.473 

.481 

. 312 * 

7 

7 

7 

Tot_Bffort 

.644 

.615 

.455a 

.447a 

.253a 

,245a 

PROJECT ATTRIBUTES 






Weiss study 
IV 4 V 

X 



X 


X 

Chronology 

recent 

early 

middle 


previous section's hypothesis of focusing on individual program- 
mers. Table 12 gives the correlations of the metrics with errors 
and weighted errors for modules that two of the individual pro- 
grammers totally developed. Even though it is encouraging to see 
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Table 12 » Spearman rank order correlations Rs with errors and 

weighted - errors for modules totally developed by two 
Individual programmers* 


Key: ? not significant at .05 level 

* significant at .05 level 

a significant at .01 level 

otherwise significant at .001 level 


Err errors 

W_err weighted-errors 


Programmer (#mods) 


P2(17) P3(21) 


E 

E" 

E** 

Cyclo_cmplx 
Cyolo”cmplx 2 
Calls” 

Calls & Jumps 
Dial/L " 
D2a1/L* 


.514» .447* 
.527* .493* 
.515* .473* 
.575a .558a 
.66la .6l6a 
? .498* 

.545* .560a 
7 7 

.558a .526* 


Source_Llnes 

Execut^Stmts 

Source-Cmmts 

V 

N 

eta1 

eta2 


B 

B* 

Revisions 
Changes 
Weighted Chg 


Deslgn_Eff 
Code Eff 
Test"Eff 
Tot Effort 


Err W err 


7 7 

.624a .577a 
7 .436* 

.491* .472* 
.494* .479* 
.497« .448* 
7 7 


.491* .472* 
.514* .447* 
7 7 

.716 .662a 

7 .510* 


7 7 

7 .450* 

7 7 

7 7 


Err W_err 

.368* 7 

.600a .563a 
.666 .649 

.463* .428« 
.484* .449* 
.506a .469* 
.598a .557a 
7 7 

.459* .429* 


.662 .646 

.579a .533a 
.635 .594a 

.679 .655 

.641 .610a 

.611a .589a 
.715 .717 


.679 .655 

.368* 7 

.830 .811 

.855 .828 

.863 .861 


.460* .392* 
.699 .667 
.668 .644 
.668 .624 
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the correspondences of the metrics B, and eta2 with errors as 
among the best for programmer P3» the same metrics do not relate 
as well for other programmers. 

In summary, partitioning an error analysis by individual 
project or programmer shows improved correlations with the vari- 
ous metrics. Strong relationships seem to depend on the indivi- 
dual programmer, while few high correlations show up on a project 
wide basis. The correlations for the projects reflect the posi- 
tive effects of reporting enforcement and collection process 
maturity. Overall, the correlations with total errors are 
slightly higher than those with weighted errors, while the number 
of revisions appears to relate the best. 

VI . Conclusions 

In the Software Engineering Laboratory, the Software Science 
metrics, cyclomatic complexity and various traditional program 
measures have been analyzed for their relation to effort, 
development errors and one another. The major results of this 
investigation are the followings 1) Hone of the metrics examined 
seem to manifest a satisfactory explanation of effort spent 
developing software or the errors incurred during that process; 
2) neither Software Science's E metric, cyclomatic complexity nor 
source lines of code relates convincingly better with effort than 
the others; 3) the strongest effort correlations are derived when 
modules obtained from individual programmers or certain validated 
projects are considered; 4) the majority of the effort correla- 
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tlons Increase with the more reliable data; 5) the number of 
revisions appears to correlate with development errors better 
than either Software Science's B metric, B metric, cyclomatlc 
complexity or source lines of code; and 6) although some of the 
Software Science metrics have size dependent properties with 
their estimators, the metric family seems to possess reasonable 
internal consistency. These and the other results of this study 
contribute to the validation of software metrics proposed in the 
literature. The validation process must continue before metrics 
can be effectively used in the characterization and evaluation of 
software and in the prediction of its attributes. 
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I. Overview 

The Software Engineering Laboratory (SEL) is a Joint effort 
between the National Aeronautics and Space Administration (NASA), 
the Computer Sciences Corporation (CSC), and the University of 
Maryland established to study the software development process. 
To this end, data has been collected for the last six years. The 
data was from attitude determination and control software 
developed by CSC, in FORTRAN, for NASA. Additional information 
on the SEL, the data collection effort, and some of the studies 
that have been made may be found in papers from the Software 
Engineering Laboratory Series published by the SEL [Card82], 
[Church82], [SEL82]. 

The Interest in the software development process is 
motivated by a desire to predict costs and quality of projects 
being planned and developed. For several years, studies have 
examined the relationships between variables such as effort, 
size, lines of code, and documentation [Walston??], C3asili8l]. 
These studies, for the most part, used data collected at the end 
of past projects to predict the behavior of similar projects in 
the future. In 1981 the SEL concluded that many of these factors 
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were too dependent on the environment to be useful for the models 
that had been developed [BalleySl]. Any model which attempts to 
trace these relationships should therefore be calibrated to the 
environment being examined. The meta-model proposed by the SEL 
is designed for such flexibility [BaileySl]. 

Another way to isolate out the environment dependent factors 
is by comparing two internal factors of a project, thus ignoring 
all outside influences. One approach that is used to monitor 
software development examines the time gap between the initial 
report of software problems and the complete resolution of the 
problem [Manley82]. Comparing two variables is useful because it 
also accentuates problem areas as they develop, providing rela- 
tive information rather than absolute information. Relative 
Information is useful to the project manager because it accentu- 
ates trends as the project develops. If project environments are 
similar, then similar values should be expected. Because the 
project environments in the SEL are similar, it was felt that 
this approach could be further extended to provide managers with 
information about how a set of variables over the course of a 
project differed from the same set of variables on other projects 
(baselines). The managers could be alerted to potential problems 
and use other variable data and project knowledge to determine 
whether the project was in trouble. 

This methodology is flexible enough to respond to changing 
needs. Every time a project is completed the measures collected 
during its development may be added in to calculate a new 
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baseline. In this way, the baselines may adapt to any changes In 
the environment, as they occur. 

Baselines might also be developed to reflect different 
attributes. For Instance, several projects which had good pro- 
ductivity might be grouped to form a productivity baseline. Once 
baselines are established, projects In progress may be compared 
against them. All measures falling outside the predetermined 
tolerance range are Interpreted by the manager. 

II . Methodology 

The Implementation of this methodology Is dependent on two 
factors. The first factor Is the availability of measures that 
are project independent and can also be collected throughout a 
project's development. Variables like programmer hours and 
number of computer runs are project dependent. By comparing 
these variables against each other a set of relative measures may 
be generated which is project Independent. For instance, the 
number of software changes may vary from project to project. The 
project dependent features shared by each variable will cancel 
out when the ratio of software changes per computer run is taken. 
The resulting relative measure is project independent. 

The second factor is the need for fixed time intervals com- 
mon to all projects. To normalize for time, project milestones 
were used. The time into a project might be twenty percent into 
coding instead of ten weeks into the project, for instance. 
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When computing the baselines one other factor was con- 
sidered. At any given interval during development a variable may 
measure either the total number of events that have occurred from 
the beginning of development (cumulative) or the number of of 
events that have occurred since the last measured interval 
(discrete). Since these approaches may convey different informa- 
tion it was felt that they both should be used. 

For simplicity, the baseline for each relative measure was 
defined as the average and standard deviation computed for the 
measure at predetermined intervals. A project's progress may now 
be charted by the software manager. At each interval in a pro- 
jects development the relative measures are compared with their 
respective baseline. Any measures outside a standard deviation 
are flagged. These measures are then interpreted by the project 
manager to determine how the project is progressing. A flagged 
measure may indicate a project is developing exceptionally well 
or it may Indicate a problem has been encountered. 

The interpretation of a set of flagged measures is a three 
step process. First, the manager must determine the possible 
interpretations for each flagged relative measure using lists of 
possible interpretations developed and verified baaed on past 
projects . 

Second, the union of the lists of possible interpretations 
of each flagged measure must be taken. The list formed by this 
union contains all the possible interpretations ordered using th-e 
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number of times each interpretation is repeated in the different 
lists. The larger the number of overlaps a possible interpreta- 
tion has, the greater the probability it is the correct interpre- 
tation . 

Third, the manager must analyze the combined list and deter- 
mine if a problem exists. Interpretations with an equal number 
of overlaps all have an equal probability of being the correct 
interpretation. If none of the possible interpretations for a 
given relative measure overlap then the relative measure should 
be considered separately. 

When analyzing the interpretations, three pieces of informa- 
tion must be considered; the measurements, the point in develop- 
ment, and the managers knowledge of the project. A relative 
measure, may indicate different things depending on the stage of 
development. For instance, a large amount of computer time per 
computer run early in the project may indicate not enough unit 
testing is being done. Personal knowledge may also give valuable 
insight . 

A fundamental assumption for using this methodology is that 
similar type projects evolve similarly. If a different type of 
project was compared to this database, the manager would have to 
decide whether the baselines were applicable. Depending on the 
type of differences, the established baselines may or may not be 
of any value. 
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EXAMPLE 1 


Forty percent into coding a software manager finds that the 
lines of source code per software change is higher than normal. 
A list previously developed is examined to determine what the 
relative measure might indicate. The possible interpretations 
for a large number of lines of source code per software change 
might be: 

- good code 

- easily developed code 

- influx of transported code 

- near build or milestone date 

. computer problems 

- poor testing approach 

If this were the only flagged measure the manager would then 
investigate each of the possibilities. If the value for the 
measure is close to the norm less concern is needed than if the 
value is further away. 

If in addition to lines of source code per software change 
the number of computer runs per software change was higher than 
normal, the manager would also examine this measure. The possi* 
ble interpretations for a large number of computer runs per 
software change might be: 

- good code 

- lots of testing 

- change backlog 

- poor testing approach 

The union of the possible interpretations of these two measures 
indicates that the strongest possible interpretations are 1 ) good 
code and 2) a poor testing approach. The number of possibilities 
to investigate is smaller because these are the only measures 
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which overlap. The manager must now examine the testing plan and 
decide whether either of these interpretations reflect what is 
actually occurring in the project. If these two possible 
interpretations do not reflect what is happening on the project, 
the manager would then examine the other interpretations. 

Ill . Baseline Development 

To develop a baseline one must first have variables whose 
measurements were taken weekly for several projects. Five vari- 
ables in the SSL database were used. The lines of source code, 
number of software changes, and number of computer runs were col- 
lected on the growth history form. The amount of computer time 
and programmer hours were collected on the resource summary form. 
Measurement of these variables started near the beginning of cod- 
ing. In this study, nine separate projects were examined whose 
development was documented, with sufficient data, in the SEL 
database. The projects ranged in size from 51-1 12K lines of 
source code with an average of 75K. No examination was done for 
the requirements or design phases. 

Once the variables were chosen the average and standard 
deviation was computed for each baseline. Some baselines suf- 
fered from limited data points during the beginning of the coding 
phase. A couple of the projects, in which problems were known to 
have existed, were flagged as soon as data on these projects 
appeared, but this was fifty percent of the way into coding. It 
is not known how much earlier they would have appeared, if data 
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existed at the early intervals. 

IV . Interpretation of Relative Measures 

Once a set of baselines are established new projects may be 
compared to them and potential problems flagged. To interpret 
these flagged relative measures a list should be developed with 
each measures possible interpretations. Each list must consider 
the possible interpretations of the relative measure when it is 
either above normal or below normal. What each component vari- 
able actually measures should also be considered when the dif- 
ferent lists are developed. 

A list was developed with possible interpretations for each 
relative measure being examined in the context of the SEL 
environment. In another environment the interpretation of these 
measures might be different. These lists are subdivided into two 
categories; above and below normal. The above normal category 
contains possible interpretations for the relative measure when 
it is outside one standard deviation from the average in the 
positive direction. The below normal category refers to 
interpretations when the measure is outside one standard devia- 
tion from the mean in the negative direction. 

One of the reasons this methodology works is because of the 
implicit interdependencies between different relative measures. 
To show these interdependencies more explicitly a cross reference 
chart has also been provided for each interpretation to indicate 
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other relative measures that can have the same interpretation. A 
number in the cross reference section indicates the list number 
of a relative measure that can have the same interpretation. The 
position of the list number in the 4-quadrant cross reference 
section indicates whether both interpretations are found with 
above normal values, both with below normal values, or one with 
above and the other with below normal values. 

With these lists a set of flagged relative measures may be 
evaluated. When a relative measure is flagged, its associated 
list is examined for possible interpretations. Overlaps of this 
list with the lists of other flagged relative measures form the 
new list of what these relative measures together might indicate. 
The more overlaps a particular interpretation has, the greater 
the -chance it is the correct interpretation. Interpretations 
with the same number of overlaps must be considered equally. The 
more relative measures flagged the more serious the problem may 
be. It is up to the manager to determine whether the deviation 
is good or bad. 

V. Monitoring a Software Project *s Development 

Once the baselines have been developed and the lists of pos- 
sible interpretations have been put together a software manager 
may monitor the actual development of a project. Example 1 
demonstrated how a single interval may be interpreted. The fol- 
lowing discussion will trace the development of an actual pro- 
ject. During the actual use of this methodology, influence would 
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be exerted to correct problems as soon as they are identified. 
With this study; we must be content to study a projects evolu- 
tion, without hindrance, and see at what points problems could of 
been detected. 

Project twenty* was chosen for this examination because data 
existed throughout the projects development. In most respects 
project twenty was an average project. The project did have a 
lower than normal productivity rate. The lower rate may be par- 
tially explained by the fact the management was less experienced 
when compared to other projects. The project also suffered from 
some delayed staffing. Changes in staffing will be noted when 
the different time intervals are discussed. 

The tables on the following page show which relative meas- 
ures were flagged when project twenty was compared to the base- 
lines for each stage of development. The numerical values 
represent how many standard deviations each flagged relative 
measure was from the baseline. The baseline for each relative 
measure was calculated using all nine projects. 

Start of Coding: 

At the start of coding only one relative measure is flagged. 
The smaller than normal number of software changes per line of 
source code using the discrete approach reflects work done during 

• The numbering convention used is an extension of the one 
first used by Bailey and Basil! [BaileySl]. 


4-59 



method of meaaureroent: cumulative 


I I ! 



1 0 



1 

1 



( 0 



1 

t 



1 c. 



1 

1 



1 3 


0 

1 

1 



1 0 


0 

1 

1 



i 91 


c. 

1 

1 



1 


3 

1 

1 



f w 


0 

1 

1 



1 0 


oa 

1 

t 



1 91 


u 

1 

t 

1 



t 0 


0 

1 

1 



t C 

49 


1 e 

t 



93 1 

C9 

93 

1 3 

1 



0 1 ^ 

U 

9 

1 u 

1 



1 V. 

3 

c 

1 V. 

1 



3 1 91 

0 


1 93 

1 



93 1 C. 

91 


1 t. 

1 



(Q f 3 



1 3 

1 



49 1 0 


49 

1 0 

1 



a 1 £ 

0 

a 

1 JZ 

1 



1 



1 

1 



49 1 £. 

a 


1 c. 

1 



> f 49 

9 


1 0 

1 



1 a 

c 

c. 

1 a 

1 



4J 1 a 


e 

1 a 

1 



(S 1 (0 


4J 

I as 

1 



1^ 1 t. 


3 

t u 

1 



49 1 fsO 

93 

a 

I eao 

1 



£. 1 0 

c 

a 

1 0 

1 



1 c- 

3 

0 

1 t. 

f 



1 a 

U 

C9 

1 Q, 

i 

1 

1 



i a 

0 

a 

1 

1 0 

1 

1 



1 CO 
1 

CO 

CO 

1 CO 
1 

1 

1 







1 

1 



! /\ 
1 


A 

1 V 
1 

1 

1 



1 t 


— 

1 

1 

1 



1 "O I on 



1 

1 



1 C + • 



> 

+ 



1 49 1 ^ 



1 

1 



1 1 

t -faJ 1 



1 

1 

t 

1 


a 

1 t. a t 


CM 

1 

1 


t. 

1 <0 0 ♦ 


• 

+ 

♦ 


o 

1 <*-9 0 1 


f— 

1 

1 


c 

1 93 m t 

t i 



1 

1 


a 

• 1 
1 93 1 


m 

1 

1 

1 

1 


o 

1 0 >, + 


• 

+ 

+ 



t m 93 1 



1 

1 



1 1 



1 

1 



1 ^ 1 



1 

1 


9] 

1 u 93 t 

^ . 

00 

1 •- 

t 


c 

!<«>*'► 

• 

• 

+ • 

+ ■ 


o 

1 4J 93 1 


*— 

1 •- 

1 


•H 

1 93 1 



1 

1 



1 1 



t 

1 

49 

as 

1^31 



t 

1 

4J 


1 0 ^ f 




4 

9 

> 

t (O 0 1 



1 

1 

C. 

o 

t 0 i 



1 

t 

0 

T3 

1 1 



1 

1 

93 


1 ¥% 49 1 



1 *- 

1 

•4 

*0 

1 0 *3 ^ 



4- • 

> 

3 


t ^ 0 1 



1 ^ 

1 


<Q 

t C t 



1 

1 


*o 

1 1 



1 

1 


c 

1 e 1 



1 CNJ 

1 

c 

ta 

10 3 + 



+ • 

+ 

9 


1 in 0 1 



1 

1 

a 

91 

1 0 1 



1 

1 

49 


1 1 



1 

1 

u 


1 ^ <0 1 



1 V— 

1 

3 

0 

1 0 3 



4 • 


93 


1^01 



1 •- 

1 

(Q 

&. 

t 0 1 



1 

1 

49 

0 

t 1- 



1 

1 

a 


1 e t 



1 

1 


a 

10 3 4* 



+ 

4 


3 

1 C^i 0 I 



I 

! 

0 

c 

t 0 1 



1 

1 



i 1 



1 

I 

3 


t U 49 1 



1 

1 

0 


I ca 3 + 



+ 


jC 


1 4^ 0 1 



1 

1 

4J 


1 93 0 1 



1 

1 

49 


I I — I B 


1 — 

- 

“ 1 

— 

1 

— 


— 

1 



1 49 


1 




1 

1 



1 0 


1 




I 



1 t. 


1 




1 



1 3 

0 

1 




1 



1 0 

C9 

1 




1 



1 93 

L. 

1 




1 



1 

3 

1 




1 



1 ^ 

0 

1 




1 



t 0 

0 

1 




1 



1 

0 0 

1 




1 



1 93 

0 0 Vi 

1 




1 



1 49 

£• £. 0 

t 




1 



1 C 49 

3 3 

1 

c 


0 

1 



93 1 <iH C9 

0 0 0 

1 

3 


to 

1 



49 1 L. 

000 

1 

C. 


c 

1 



C. 1^3 

C 

I 

\ 


0 

1 



3 1 93 0 

V. Cni ^ 

1 

0 


IZ 

1 



93 1 &. 93 

0 0 

1 

U 


0 

1 



CB 1 3 

\ 

1 

3 


\ 

1 



49 10^ 

0 0 0 

1 

0 


0 

1 



a 1^0 

00a 

1 

£ 


a 

1 



t 

C C 

1 




1 



49 1 £. 93 

^ ^ 4-> 

1 

£• 


44 

1 



> 1 49 49 


t 

0 



t 



1 a c 


1 

a 


£. 

1 



1 a 

0 0 0 

1 

a 


0 

1 



0 1 0 r-i 

0 0 44 

1 

0 


44 

1 



-4 1 t. 

CO CO 3 

1 

t. 


3 

1 



49 1 to 93 

c c a 

1 

to 


0, 

1 



U IOC 

00a 

1 

0 


a 

1 



1 £. 3 

£ £ 0 

1 

£. 


0 

1 



t a u 

000 

1 

a 


0 

1 



1 0 a 

0 0 o 

! 

0 


0 

1 



1 CO CO 

1 

CO CO CO 

1 

CO 


CO 

1 




^ 

1 




1 



1 A A 

I 

V/ A A 

1 

1 



A 

1 

1 



1 

3 t ^ 


1 

1 




1 

1 

1 



C 4 • 

• 

4 


4 


4 



49 1 CM 

CM 

1 




1 



1 

4J 1 


1 

1 




t 

I 

1 8 



0 0 

1 




1 

1 U 


0 0 4 • ♦ 

« • 

4 


4 


4 

t 0 


0 1 (M 

CM CM 

1 




1 

I e 


n 0 1 


1 




I 



1 


1 




f 

1 a 


X 0 1 


1 




1 

1 0 


0 4 


4 


4 


4 

1 u 


m 0 1 


1 




1 

1 w> 


1 


t 




I 



<u 1 


1 




1 

1 93 


£.0 1 m 03 

0 *- 

1 



CM 

1 

1 c 


0 >* 4 • • 

• • 

4 


4 

■ 

4 

1 0 


4-9 0 1*-*- 

CM CM 

1 



«— 

1 

i- 


0 1 


t 




1 

1 4^ 


1 


1 




1 

i 9 


9 t 


1 




1 

1 


0 3 4 


4 


4 


4 

1 > 


CO 0 1 


1 




1 

1 49 


0 1 


1 




1 

1 3 


1 


1 




1 



0 1 


1 




I 

1 3 


034 


4 


4 


4 

t 


VO 0 I 


1 




1 

1 as 


0 1 


1 




1 

t 3 


1 


1 




1 

1 C 


0 1 CO 00 


1 




1 

i (0 


Q 3 4 • • 

• • 

4 


4 


4 

1 4J 


U^ 0 1 — — 

^ t— 

1 




1 

1 91 


0 1 


I 




1 

1 ^ 


0 1 — 

m 

1 

1 




1 

1 

1 0 


0 3 4 • 

• 

4 


4 


4 



0- 0 1 •- 

»— 

1 




1 

t u 


0 1 


1 




I 

I 49 


1 


1 




1 

t JO 


^ 0 t 0 <M 

— CM 

1 

CM 



1 

1 a 


034 • • 

• • 

4 

• 

4 



1 3 


CM 0 J 

r— 

1 




} 

t c 


0 1 


1 




1 



4J 1 


1 




1 



U 0 1 

*- 

1 




1 



0 3 4 

• 

4 


4 


4 



4^ 0 1 *- 

1 




1 

1 •• • 

__ 

0 0 1 

W ■ II 

1 

1 




! 

I 


4-60 


the design phase. The lists designed in the previous section 
were directed towards code production and testing and do not 
apply to this time interval when using the discrete approach. 
This measure may indicate good specifications or lots of PDL 
being generated. The manager might want to examine this measure 
later if it constantly repeated. Since it is the only measure 
flagged at this time it will be ignored. 

20 % Coding: 

The flagged relative measures found using the discrete 
approach at this point represent the work done from the start of 
coding until twenty percent of the way through coding. The list 
of possible interpretations for the flagged relative measures, 
generated from the lists made previously for the individual rela- 
tive measure, would look like: 

# overlaps interpretation 

3 bad specifications 

3 code removed 

2 low productivity 

2 high complexity 

2 • error prone code 

1 lots of testing 

1 good testing 

changes hard to isolate 
changes hard to make 
unit testing being done 
easy errors being found 

The strongest interpretations are bad specifications and code 
being removed. If the actual history is examined one finds that 
during this period there were a lot of specifications being 
changed. This resulted in code which was to be modified being 


4-61 



discarded and new code being written. During the early period 
lots of PDL was being produced but very little new executable 
code. The list of possible interpretations does show that low 
productivity is also a strong possibility. 

401C Coding: 

The flagged relative measures which appear using the cumula- 
tive approach, from this time period on, are stronger indicators 
than the ones used in the first couple of intervals because the 
average is computed using more data points. The use of the 
discrete approach for the interval of twenty to forty percent is 
still dependent on three data points. The list of possible 

interpretations for this time period is: 

# overlaps interpretation 

1 low productivity 

1 high complexity 

1 error prone code 

1 bad specifications 

1 code being removed 

changes hard to isolate 
changes hard to make 
lots of testing 
unit testing being done 
good testing 
easy errors 

The number of possibilities is larger with this set of possible 
interpretations. Five interpretations are slightly stronger than 
the others. During the actual development, the first release of 
the project was made. The amount of code actually written was 
also lower than normal during this period. The use of the 
discrete approach gives a stronger feeling that code is not being 
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written. Transported code tends to be installed in large blocks 
which can be isolated using the discrete approach. 

50 % Coding: 

The relative measures flagged during this period are the 
same as the ones flagged at the twenty percent coding interval. 
The deviation from the norm for this interval is larger. The 
larger deviation may indicate a more serious problem. The prob- 
lem may of been just as serious earlier but without the extra 
data points, that are now available, it could not be determined. 
The possible interpretations may be taken from the list developed 
earlier. Bad specifications and code removal were not factors 
during this period. The next three highest priority interpreta- 
tions were; high complexity, error prone code, and low produc- 
tivity. In addition to this the manager should be concerned with 
the continued appearance of the relative measure, programmer 
hours per computer run, as seen using the cumulative approach. 
This may indicate a lot of testing going on. This in conjunction 
with error prone code as a possible interpretation may indicate 
trouble. During actual development this period was spent 
developing code for the second release. The project manager felt 
that code was still not being developed quickly enough during 
this period. 

60H Coding; 
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Only one relative measure is shown at this interval. The 
number of programmer hours per computer run using the cumulative 
approach is lower than normal for the third consecutive time. 
This should concern the manager because when examining the list 
for this measure one finds; 

error prone code 

lots of testing 

easy errors being fixed 

Since the occurrence of this measure is persistent it may indi- 
cate that the problem was corrected but not enough effort was 
expended to completely compensate for the past problems. It 
might also indicate the problem still exists. During the actual 
project it was found that while a lot of code was written, it had 
not been throughly tested. Release two was made during this 
period which could explain a heavy test load. Two additional 
staff members were added to the project during this phase to aid 
in coding and testing. 

80J Coding: 

The eighty percent coding interval does not show any meas- 
ures outside the normal bounds. The addition of two staff 
members during the sixty percent coding phase, as well as the 
addition of a senior staff member during this phase, appears to 
have adjusted the project back along the lines of normal develop- 
ment. To fully compensate for the earlier problems one might 
expect some of the measures to swing in the other direction away 
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from the average. The fact this over correction did not occur 
might explain the problems encountered in the next section. 


Start of System and Integration Testing: 


The flagged relative measures at this time period reflect 
the build up of effort for the third and final release. The list 


of possible interpretations for the collective set of flagged 
measures looks like: 


# overlaps 


interpretation 


3 

3 

3 

2 

2 

2 

1 

1 

1 

1 


high complexity 
bad specifications 
code being removed 
error prone code 
low productivity 
lots of testing 
changes hard to Isolate 
unit testing being done 
good code 
poor testing 
changes hard to make 
good testing 

compute bound algorithms 
being run 

easy errors being fixed 


Since the code 


did have a past history of poor testing an 


unusu- 


ally large build up 


of testing should be expected. The two 


interpretations that apply most to this situation are lots of 


testing and error prone code. 


50 % System and Integration Testing: 

Only one relative measure is flagged at this interval. This 
measure was flagged using the cumulative approach. An examina- 
tion of the measure at the previous interval shows a very high 
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value. A slow drop off from this high measure is to be expected 
when using the cumulative approach. An examination of possible 
interpretations that would apply for this period of development 
include : 

high complexity 
lots of testing 
unit testing being done 
testing code being removed 

A lot of testing is certainly indicated by past history. 


Start Acceptance Testing: 


The relative measures flagged at this interval reflects the 


build up in testing before the start of acceptance testing. The 
list of possible interpretations looks like: 


# overlaps 


interpretation 


3 

3 

2 

2 

1 

1 


bad specifications 
code being removed 
high complexity 
low productivity 
error prone code 
lota of testing 
changes hard to isolate 
changes hard to make 
unit testing being done 
good testing 


Since little code was being developed during the testing period, 


a large amount of testing with errors being found is the most 


reasonable interpretation of these flagged measures. The early 
history of poor testing may be seen here with errors being 


uncovered late. 
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End Acceptance Testing: 


The two flagged relatire measures at the end of acceptance 
testing reflect the clean up effort being made on the code4 An 
average amount of computer time and an average number of computer 
runs Indicates that the acceptance testing Is going well. The 
project was behind schedule due to the earlier problems encoun- 
tered. Clean up was done during the acceptance testing phase in 
an attempt to get the project out the door as soon as possible. 

As seen In this example ^ the problems that occur during a 
projects development are reflected in the values calculated for 
the relative measures. The methodology preposed can be used to 
monitor projects. The number of possible interpretations 
Increases with each new flagged relative measure. The ordering 
of the measures by the number of overlaps provides an easy method 
of sorting the possible interpretations by priority. Another 
method of sorting the possible interpretations could include a 
factor that considers both the number of overlaps and the proba- 
bility of a given interpretation being the cause at a given 
interval. The weighting of interpretations for a given interval 
could be calculated using the pattern of occurrence of the dif- 
ferent interpretations which have appeared during the same inter- 
val in past projects. 

VI. Aji Alternate Approach 


4-67 



Flagged relative measures might also be Interpreted using a 
decision support system. The data for the various relative meas- 
ures would be stored in a knowledge base along with a set of pro- 
duction rules. To evaluate a project the values for each rela- 
tive measure would be entered into the system. The knowledge 
base would compare the relative measures to their respective 
baselines, determine which relative measures were outside the 
norm, and interpret these relative measures using the production 
rules. A list of possible interpretations ordered by probability 
would be generated as a result. 

The difference between a decision support system and the 
approach presented in this paper is the method of interpreting 
the flagged relative measures. Each production rule in the deci- 
sion support • system is the logical disjunction of several flagged 
measures which yields a given interpretation. Each production 
rule is assigned a confidence rating which is then used to rate 
the possible interpretations. The lists for the relative meas- 
ures provided earlier in the paper may be easily converted to 
production rules using the cross reference section. To develop 
the production rules for an interpretation one must generate the 
various combinations of relative measures which might reasonably 
imply the interpretation. Some relative measures may not imply a 
particular interpretation unless they are found in conjunction 
with another relative measure. Once the production rules are 
known and a knowledge base constructed a decision support system 
may be built. For an example of a domain independent decision 
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support system see Reggia and Perricone [Reggia82]. 

VII . Summary 

The methodology presented in this paper showed that invari- 
ant relationships exist for similar projects. New projects may 
be compared to the baselines of these invariant relationships to 
determine when projects are getting off track. 

The ability of the manager to interpret the measures that 
fall outside the norm is dependent on the amount of information 
the underlying variables convey. The manager must decide what 
attributes are to be measured (e.g. productivity) and pick vari- 
ables that are closely related to them and are also measurable 
throughout the project. As an example, a variable like lines of 
code may be too general when measuring productivity. Measuring 
the newly developed code, either source code or executable code, 
would be more informative since these variables are more directly 
related to effort. How applicable an interpretation is for the 
period currently being examined should also be considered when 
ordering the list. The variables the manager finally decides on 
are then combined to form relative measures. 

One method of interpreting a relative measure is by associ- 
ating lists of possible interpretations with it. When a relative 
measure appears outside the norm, the list of possible interpre- 
tations is considered. If more than one relative measure is out- 
side the norm the lists are combined. The more times a possible 
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Interpretation Is repeated in the lists, the greater the proba- 
bility it is the cause. How applicable an interpretation is for 
the period being examined should also be considered when ordering 
the list. The manager must investigate the suggested causes to 
determine the real one. 

VIII . Conclusion 

The ability to monitor a projects development and detect 
problems as they develop may be feasible. The methodology pro- 
posed showed favorable results when examining a past case. 

The use of baselines and lists of interpretations for com- 
paring projects provides an easy method for monitoring software 
development. Both the baselines and the lists of interpretations 
may be updated as new projects are developed. As more knowledge 
is gleaned the accuracy of this system should improve and provide 
a valuable tool for the manager. 
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ABSTRACT 


The distributions and relationships derived from the change 
data collected during the development of a medium scale 
satellite software project shows that meaningful results can 
be obtained which allow an insight into software traits and 
the environment in which it is developed. Modified and new 
modules were shown to behave similarly. An abstract classif- 
ication scheme for errors which allows a better understand- 
ing of the overall traits of a software project is also 
shown. Finally, various size and complexity metrics are 
examined with respect to errors detected within the software, 
yielding some interesting results. 
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1.0 INTRODUCTION 


The discovery and validation of fundamental relation- 
ships between the development of computer software, the 
environment in which the software is developed, and the fre- 
quency and distribution of errors associated with the 
software are topics of primary concern to investigators in 
the field of software engineering. Knowledge of such rela- 
tionships can be used to provide an insight into the charac- 
teristics of computer software and the effects that a pro- 
gramming environment can have on the software product. In 
addition, it can provide a means to improve the understand- 
ing of the terms reliability and quality with respect to 
computer software. In an effort to acquire a knowledge of 
these basic relationships, change data for a medium scale 
software project was analyzed (e.g., change data is any 
documentation which reports an alteration made to the 
software for a particular reason) . 

In general, the overall objectives of this paper are 
threefold ; first, to report the results of the analyses; 
second, to review the results in the context of those 
reported by other researchers; and third, to draw some con- 
clusions based on the aforementioned. The analyses 
presented in this pap^r encompass various types of distribu- 
tions based on the collected change data. The most impor- 
tant of which are the error distributions observed within 
the software project. 

In order for the reader to view the results reported in 
this paper properly, it is important that the terms used 
throughout this paper and the environment in which the data 
was collected are clearly defined. This is pertinent since 
many of the terms used within this paper have appeared in 
the general literature often to denote different concepts. 
Understanding the environment will allow the partitioning of 
the results into two classes: those which are dependent on 
and those which are independent of a particular programming 
environment. 


1.1 DESCRIPTION OF THE ENVIRONMENT 


The software analyzed within this paper is one of a 
large set of projects being analyzed in the Software 
Engineering Laboratory (SEL). The particular project 
analyzed in this paper is a general purpose program for 
satellite planning studies. These studies include among 
others: mission maneuver planning; mission lifetime; mission 
launch; and mission control. The overall size of the 
software project was approximately 90,000 source lines of 
code. The majority of the software project was coded in FOR- 
TRAN. The system was developed and executes on an IBM 360. 
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, The developers of the analyzed software had extensive 
experience with ground support software for satellites. The 
analyzed system represents a new application for the 
development group, although it shares many similar algo- 
rithms with the system studied here. 

It is also true that the requirements for the system 
analyzed kept growing and changing, much more so than for 
the typical ground support software normally built. Due to 
the commonality of algorithms from existing systems, the 
developers re-used the design and code for many algorithms 
needed in the new system. Hence a large number of re-used 
(modified) 

modules became part of the new system analyzed here. 

An approximation of the analyzed software's life cycle 
is displayed in Figure 1 . This figure only illustrates the 
approximate duration in time of the various phases of the 
software's life cycle. The information relating the amount 
of manpower involved with each of the phases shown was not 
specific enough to yield meaningful results, so it was not 
included. 
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LIFE CYCLE OF ANALYZED SOFTWARE 
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Figure 1 


1. 2 TERMS 

This section presents the definitions and associated 
contexts for the terms used within this paper. A discussion 
of the concepts involved with these terms is also given when 
appropriate. 


Module ; A module is defined as a named subfunction, subrou- 
tine, or the main prograun of the software system. This 
definition is used since only segments written in FORTRAN 
which contained executable code were used for the analyses. 
Change data from the segments which constituted the data 
blocks, assembly segments, common segments, or utility rou- 
tines were not included. However, a general overview of the 
data available on these types of segments is presented in 
Section 4.0 for completeness. 

There are two types of modules referred to within this 
paper. The first type is denoted as modified . These are 
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modules which were developed for previous software projects 
and then modified to meet the requirements of the new pro- 
ject. The second type is referred to as new. These are 
modules which were developed specifically for the software 
project under analyses. 

The entire software project contained a total of 517 
code segments. This quantity is comprised of 36 assembly 
segments, 370 FORTRAN segments, and 111 segments that were 
either common modules, block data, or utility routines. The 
number of code segments which met the adopted module defini- 
tion was 370 out of 517 which is 72> of the total modules 
and constitutes the majority of the software project. Of 
the modules found to contain errors 49% were categorized as 
modified and 51$ as new modules. 


Number of Source and Executable Lines ; The number of source 
lines within a module refers to the number of lines of exe- 
cutable code and comment lines contained within it. The 
number of executable lines within a module refers to the 
number of executable statements, comment lines are not 
included. 

Some of the relationships presented in this paper are 
based on a grouping of modules by module size in increments 
of 50 lines. This means that a module containing 50 lines 
of code or less was placed in the module size of 50; modules 
between 51 and 100 lines of code into the module size of 
100, etc. The number of modules which were contained in 
each module size is given in Table 1 for all modules and for 
modules which contained errors (i.e., a subset of all 
modules) with respect to source and executable lines of 
code. 
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Number modules 




All 

Modules 

Modules 

with Errors 

Number 

of Lines 

Source 

Exececutable 

Source 

Executable 

0-50 

53 

258 

3 

49 

51-100 

107 

70 

16 

25 

101-150 

80 

26 

20 

13 

151-200 

56 

13 

19 

7 

201-250 

34 

1 

12 

1 

251-300 

14 

1 

9 

0 

301-350 

7 

1 

4 

1 

351-400 

9 

0 

7 

0 

>400 

10 

0 

6 

0 

Total 

370 

370 
Table 1 

96 

96 


Error ; Something detected within the executable code which 
caused the module in which it occurred to perform 
incorrectly (i.e., contrary to its expected function ). 

Errors were quantified from two view points in this 
paper, depending upon the goals of the analysis of the error 
data. The first quantification was based on a textual rather 
than a conceptual viewpoint. This type of error quantifica- 
tion is best illustrated by an example. If a was 
incorrectly used in place of a then all occurrences of 
the will be considered an error. This is the situation 
even if the "*"'s appear on the same line of code or within 
multiple modules. The total number of errors detected in 
the 370 software modules analyzed was 215 contained within a 
total of 96 modules, implying 26$ of the modules analyzed 
contained errors. 

The second type of quantification was used to measure 
the effect of an error across modules, textual errors asso- 
ciated with the same conceptual problem were combined to 
yield one conceptual error. Thus in the example above, all 
incorrectly used *'s replaced by +'s in the same formula 
were combined and the total number of modules effected by 
that error are listed. This is done only for the errors 
reported in Figure 2. There are a total of 155 conceptual 
errors. All other studies in this paper are based upoon the 
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first type of quantification described 


Statistical Terms and Methods ; All linear regressions of the 
data presented within this paper employed as a criterion of 
goodness the least squares principle (i.e., "choose as the 
'best fitting' line that one which minimizes the sum of 
squares of the deviations of the observed values of y from 
those predicted" [1]). 

Pearson's product moment coefficient of correlation was 
used as an index of the strength of the linear relationship 
independent of the respective scales of measurement for y 
and X. This index is denoted by the symbol r within this 
paper. The measure for the amount of variability in y 
accounted for by linear regression on x is denoted as r2 
within this paper. 

All of the equations and explanations for these statis- 
tics can be found in [ 1 ] . It should be noted that other 
types of curve fits were conducted on the data. The results 
of these fits will be mentioned later in the paper. 


Now that the software's environment and the key terms 
used within the paper have been defined and outlined, a dis- 
cussion of the basic quantification of the data collected, 
the relationships and distributions derived from this quan- 
tification, and the resulting conclusions are presented. 


2.0 BASIC DATA 


The change data analyzed was collected over a period of 
33 months, August 1977 through May 1980. These dates 
correspond in time to the software phases of coding, test- 
ing, acceptance, and maintenance (Figure 1) . The data col- 
lected for the analyses is not complete since changes are 
still being made to the software analyzed. However, it is 
felt that enough data was viewed in order to make the con- 
clusions drawn from the data significant. 

The change data was entered on detailed report .sheets 
which were completed by the programmer responsible for 
implementing the change. A sample of the change report form 
is given in the Appendix. In general, the form required 
that several short questions be answered by the programmer 
implementing the change. These queries allowed a means to 
document the cause of a change in addition to other charac- 
teristics and effects attributed to the change. The major- 
ity of this information was found useful in the analyses. 
The key information used in the study from the form was: the 
data of the change or error discovery, the description of 
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the change or error, the number of components changed, the 
type of change or error, and the effort needed to correct 
the error. 

It should be mentioned that the particular change 
report form shown in the Appendix was the most current form 
but was not uniformly used over the entire period of this 
study. In actuality there were three different versions of 
the change report form, not all of which required the same 
set of questions to be answered. Therefore , for the data 
that was not present on one type of form but could be 
inferred, the inferred value was used. An example of such 
an inference wou^ be that of determining the error type . 
Since the error description was given on all of the forms 
the error type could be inferred with a reasonable degree of 
reliability. Data not incorporated into a particular data 
set used for an analysis was that data for which this infer- 
ence was deemed unreliable. Therefore, the reader should be 
alert to the cardinality of the data set used as a basis for 
some of the relationships presented in this paper. There 
was a total of 231 change report forma examined for the pur- 
pose of this paper. 

The consistency and partial validity of the forms was 
checked in the following manner. First, the supervisor of 
the project looked over the change report forms and verified 
them (denoted by his or her signature and the date). 
Second, when the data was being reduced for analysis it was 
closely examined for contradictions. It should be noted 
that interviews with the individuals who filled out the 
change forms were not conducted. This was the major differ- 
ence between this work and other error studies performed by 
the Software Engineering Laboratory, where interviews were 
held with the programmers to help clarify questionable data 
( 8 ). 


The review of the change data as described above 
yielded an interesting result. The errors due to previous 
miscorrections showed to be three times as common after the 
form review process was performed, i.e. before the review 
process they accounted for 2J of the errors and after the 
review process they accounted for of the errors. These 
recording errors are probably attributable to the fact that 
the corrector of an error did not know the cause was due to 
a previous fix because the fix occurred several months ear- 
lier or was made by a different programmer, etc. 


3.0 RELATIONSHIPS DERIVED FROM DATA 


This section presents and discusses relationships derived 
from the change data. 
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3. 1 CHANGE DISTRIBUTION BY TYPE 

Types of changes to the software can be categorized as 
error corrections or modifications (specification changes^ 
planned enhancements, clarity and optimization improve- 
ments). For this project, error corrections accounted for 
62t of the changes and modifications 38%. In studies of 
other SEL projects, errors corrections ranged from U0% to 
64% of the changes. 


3.2 ERROR DISTRIBUTION BY MODULES 

Figure 2 shows the effects of an error in terms of the 
number of modules that had to be changed. (Note that these 
errors here are counted as conceptual errors.) It was found 
that 89% of the errors could be corrected by changing only 
one module. This is a good argument for the modularity of 
the software. It also shows that there is not a large 
amount of interdependence among the modules with respect to 
an error. 


NUMBER OF MODULES 

AFFECTED BY AN ERROR (data 

set; 211 textual 
174 conceptual e 

#ERRORS 

#M0DULES AFFECTED 


155 (89%) 

1. 


9 

2 


3 

3 


6 

4 


1 

5 


Figure 2 


Figure 3 shows the number of errors found per module. 
The type of module is shown in addition to the overall total 
number of modules found to contain errors. 
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NUMBER OF ERRORS PER MODULE (data set: 215 errors) 


^MODULES 

NEW 

MODIFIED 

#ERR0RS/M0DULE 

36 

17 

19 

1 

26 

13 

13 

2 

16 

10 

6 

3 

13 

7 

6 

4 

4 

1»» 

3* 

5 

1 

1»» 


7 



Figure 

3 


The largest number of errors found were 7 (located in a 
single new module) and 5 (located in 3 different modified 
modules and 1 new module). The remainder of the errors were 
distributed almost equally among the two types of modules. 

The effort associated with correcting an error is 
specified on the form as being (1) 1 hour or less, (2) 1 
hour to 1 day, (3) 1 day to 3. days, (4) more than 3 days. 
These categories were chosen because it was too difficult to 
collect effort data to a finer granularity. To estimate the 
effort for any particular error correction, an average time 
was used for each category, i.e. assuming an 8 hour day, an 
error correction in category (1) was assumed to take .5 
hours, an error correction in category (2) was assumed to 
take 4.5 hours, category (3) 16 hours, and category (4) 32 
hours . 


The types of errors found in the three most error prone 
modified modules (• in Figure 3) and the effort needed to 
correct them is shown in Table 2. If any type contained 
error corrections from more than one error correction 
category, the associated effort for them was averaged. The 
fact that the majority of the errors detected in a module 
was between one and three shows that the total number of 
errors that occurred per module was on the average very 
small. 
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The twelve errors contained in the two most error prone 
new modules (** in Figure 3) are shown in Table 3 along with 
the effort needed to correct them. 


NUMBER OF ERRORS 
(15 total) 

AVERAGE EFFORT [ 
TO CORRECT 

misunderstood 
or incorrect 
specifications 

8 

24 hours 

incorrect design 
or implementation 
of a module 
component 

5 

1 6 hours 

clerical error 

2 

4.5 hours 

EFFORT TO CORRECT ERRORS IN THREE 
MODIFIED MODULES 
Table 2 

MOST ERROR PRONE 


NUMBER 

(12 

OF ERRORS 
total) 

AVERAGE EFFORT 
TO CORRECT 

misunderstood 
or incorrect 
requirements 

8 

32 hours 

incorrect design 
or implementation 
of a module 

3 

0.5 hours 

clerical error 

1 

0.5 hours 

EFFORT TO CORRECT ERRORS IN THE TWO MOST ERROR PRONE 
NEW MODULES 
Table 3 
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3.3 ERROR DISTRIBUTION BY TYPE 

In Figure 4 the distribution of errors are shown by type. It 
can be seen that 48% of the errors were attributed to 
incorrect or misinterpreted functional specifications or 
requirements . 

The classification for error used throughout the 
Software Engineering Laboratory is given below. The person 
identifying the error indicates the class for each error. 

A: Requirements incorrect or misinterpreted 
B: Functional specification incorrect or misinterpreted 
C; Design error invloving several components 

1 . mistaken assumption about value or structure of 
data 

2. mistake in control logic or computation of an 
expression 

D: Error in design or implementation of single component 

1 . mistaken assumption about value or structure of 
data 

2. mistake in control logic or computation of an 
expression 

E; Misunderstanding of external environment 
F; Error in the use of programming language/compiler 
G: Clerical error 

H; Error due to previous miscorrection of an error 


The distribution of these errors by source is plotted 
in Figure 4 with the appropriate subdistribution of new and 
modified errors displayed. This distribution shows the 
majority of errors were the result of the functional specif- 
ication being incorrect or misinterpreted . Within this 
category, the majority of the errors (24%) involved modified 
modules This is most likely due to the fact that the modules 
reused were taken from another system with a different 
application. Thus, even though the basic algorithms were the 
same, the specification was not well enough defined or 
appropriately defined for the modules to be used under 
slightly different circumstances. 
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SOURCES OF ERROR ON OTHER PROJECTS 
Figure 5 


The distribution in Figure 4 should be compared with 
the distribution of another system developed by the same 
organization shown in Figure 5. Figure 5 represents a typi- 
cal ground support software system and was rather typical of 
the error distributions for these systems. It is different 
from the distribution for the system we are discussing in 
this paper however, in that the majority of the errors were 
involved in the design of a single component. The reason 
for the difference is that in ground support systems, the 
design is well understood, the developers have had a reason- 
able amount of experience with the application. Any re-used 
design or code comes from similar systems, and the require- 
ments tend to be more stable. An analysis of the two distri- 
butions makes the differences in the development environ- 
ments clear in a quantitative way. 
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The percent of requirements and specification errors is 
consistent with the work of Endres'[1]. Endres found that 
46/S of the errors he viewed involved the misunderstanding of 
the functional specifications of a module. Our results are 
similar even though Endres' analysis was based on data 
derived from a different software project and programming 
environment. The software project used in .Endres' analysis 
contained considerably more lines of code per module, was 
written in assembly code, and was within the problem area of 
operating systems. However, both of the software systems 
Endres analyzed did contain new and modified modules. 

Of the errors due to the misunderstanding of a module s 
specifications or requirements (48J), 20? involved new 
modules while 28? involved modified modules. 

Although the existence of modified modules can shrink 
the cost of coding, the amount of effort needed to correct 
errors in modified modules might outweigh the savings. The 
effort graph (Figure 6) supports this viewpoint; 50? of the 
total effort required for error correction occurred in modi- 
fied modules; errors requiring one day to more than three 
days to correct accounted for 45? of the total effort with 
27? of this effort attributable to modified modules within 
these greater effort classes. Thus, errors occurring in new 
modules required less effort to correct than those occurring 
in modified modules. 
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EFFORT GRAPH 
Figure 6 


The similarity between Endres' results and those 
reported here tend to support the statement that independent 
of the environment and possibly the module size, the major- 
ity of errors detected within software is due to an inade- 
quate form or interpretation of the specifications. This 
seems especially true when the software contains modified 
modules . 

In general, these observations tend to indicate that 
there are disadvantages in modifying a large number of 
already existing modules to meet new specifications. The 
alternative of developing a new module might be better in 
some cases if there does not exist good specifications for 
the existing modules. 


3.^ OVERALL NUMBER OF ERRORS OBSERVED 


Figure 7 displays the number of errors observed in both 
new and modified modules. The curve representing total 
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modules (new and modified) is basically bell-shaped. One 
interpretation is that up to some point errors are detected 
at a relatively steady rate. At this point at least half of 
the total "detected-undetected” errors have been observed 
and the rate of discovery thereafter decreases. It may also 
imply the maintainers are not adding too many new errors as 
the system evolves. 

It can be seen, however, that errors occurring in 
modified modules are detected earlier and at a slightly 
higher rate than those of new modules. One hypothesis for 
this is that the majority of the errors observed in modified 
modules are due to the misinterpretation of the functional 
specifications as was mentioned earlier in the paper. 
Errors of this type would certainly be more obvious since 
they are more blatant than those of other types and there- 
fore, would be detected both earlier and more readily. (See 
next section.) 
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ORiGl^vAL PAGE IS 
OF POOR QUALITY 



NUMBER OF ERRORS OCCURRING IN MODULES 
Figure 7 


3. 5 ABSTRACT ERROR TYPES 

An abstract classification of errors was adopted by the 
authors which classified errors into one of five categories 
with respect to a module: (1) initialization; (2) control 
structure; (3) interface; (4) data; and (5) computation. 
This was done in order to see if there existed recurring 
classes of errors present in all modules independent of 
size. These error classes are only roughly defined so exam- 
ples of these abstract error types are presented below. It 
should be noted that even though the authors were consistant 
with the categorization for this project, another error 
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analyst may have interpreted the categories differently. 

Failure to Initialize or re- initialize a data structure 
properly upon a module's entry/exit would be considered an 
initialization error. Errors which caused an "incorrect- 
path" in a module to be taken were considered control 
errors. Such a control error might be a conditional state- 
ment causing control to be passed to an incorrect path. 
Interface errors were those which were associated with 
structures existing outside the module's local environment 
but which the module used. For example, the incorrect 
declaration of a COMMON segment or an incorrect subroutine 
call would be an interface error. An error in the declara- 
tion of the COMMON segment was considered an interface error 
and not an initialization error since the COMMON segment was 
used by the module but was not part of its' local environ- 
ment. Data error would be those errors which are a result 
of the incorrect use of a data structure. Examples of data 
errors would be the use of incorrect subscripts for an 
array, the use of the wrong variable in an equation, or the 
inclusion of an incorrect declaration of a variable local to 
the module. Computation errors were those which caused a 
computation to erroneously evaluate a variable s value. 
These errors could be equations which were incorrect not by 
virtue of the incorrect use of a data structure within the 
statement but rather by miscalculations. An example of this 
error might be the statement A = B + 1. when the statement 
really needed was A = B/C + 1 . 

These five abstract categories basically represent all 
activities present in any module. The five categories were 
further partitioned into errors of commission and omission. 
Errors of commission were those errors present as a result 
of an incorrect executable statement. For example, a com- 
missioned computational error would be A = B * C where the 
'»' should have been" In other words, the operator was 
present but was incorrect. Errors of omission were those 
errors which were a result of forgetting to include some 
entity within a module. For example, a computational omis- 
sion error might be A = B when the statement should have 
read A = B + C. A parameter required for a subroutine call 
but not included in the actual call would be an example of 
an interface omission error. In both of the above examples 
some aspect needed for the correct execution of a module was 
forgotten. 

The results of this abstract classification scheme as 
discussed above is given in Figure 8. Since there were 
approximately an equal amount of new (49) and modified (47) 
modules viewed in the analysis, the results do not need to 
be normalized. Some errors and thereby modules were counted 
more than once since it was not possible to associate some 
errors with a single abstract error type based on the error 
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description given on the change report form 


commission omission 



new 

modified 

new 

modified 

initialization 

2 

9 

5 

9 

control 

12 

2 

16 

6 

interface 

23 

31 

27 

6 

data 

10 

17 

1 

3 

computation 

16 

21 

3 

3 


28% 36% 

23% 12% 

6U% 

35% 



total 




new 

modified 



initialization 

7 

18 

25 

(11» 

control 

28 

8 

36 

(16$) 

interface 

50 

37 

87 

(39$) 

data 

11 

20 

31 

(14$) 

computation 

19 

24 

43 

(19$) 


115 

107 




ABSTRACT CLASSIFICATION OF ERRORS 
Figure 8 


According to Figure 8, interfaces appear to be the 
major problem regardless of the module type. Control is more 
of a problem in new modules than in modified modules. This 
is probably because the algorithms in the old modules had 
more test and debug time. On the other hand, initialization 
and data are more of a problem in modified modules. These, 
facts, coupled with the small number of errors of omission 
in the modified modules might imply that the basic algo- 
rithms for the modified modules were correct but needed some 
adjustment with respect to data values and initialization 
for the application of that algorithm to the new environ- 
ment . 


3.6 MODULE SIZE AND ERROR OCCURRENCE 
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Scatter plots for executable lines per module versus 
the number of errors found in the module were plotted. It 
was difficult to see any trend within these plots so the 
number of errors/ 1000 executable lines within a module size 
was calculated (Table 4). 


Module Size 

Errors/ 1000 lines 

50 

16.0 

100 

12.6 

150 

12.4 

200 

7.6 

>200 

6.4 

ERRORS/ 1000 EXECUTABLE 

LINES (INCLUDES ALL MODULES) 


Table 4 


The number of errors was normalized over 1000 executable 
lines of code in order to determine if the number of 
detected errors within a module was dependent on module 
size. All modules within the software were included, even 
those with no errors detected. If the number of errors/ 1000 
exececutable lines was found to be constant over module size 
this would show independence. An unexpected trend was 
observed: Table 4 implies that there is a higher error rate 
within smaller sized modules. Since only the executable 
lines of code were considered the larger modules were not 
COMMON data files. Also the larger modules will be shown to 
be more complex than smaller modules in the next section. 
Then how could this type of result occur? 

The most plausable explanation seems to be that since 
there are a large number of interface errors, these are 
spread equally across all modules and so there are a larger 
number of errors/ 1000 executable statements for smaller 
modules. Some tentative explanations for this behavior are: 
the majority of the modules examined were small (Table 1) 
causing a biased result; larger modules were coded with more 
care than smaller modules because of their size; errors in 
smaller modules are more apparent and there may indeed still 
be numerous undetected errors present within the larger 
modules since all the "paths" within the larger modules may 
not yet have been fully exercised. 


^. 7 MODULE COMPLEXITY 

Cyclomatic complexity [5] (number of decisions + 1) was 
correlated with module size. This was done in order to 
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determine whether or not larger modules were less dense or 
complex than smaller modules containing errors. Scatter 
plots for executable statments per module versus the 
cyclomatic complexity were plotted and again, since it was 
difficult to see any trend in the plots, modules were 
grouped according to size. The complexity points were 
obtained by calculating an average complexity measure for 
each module size class. For example, all the modules vdilch 
had 50 executable lines of code or less had an average com- 
plexity of 6.0. Table 5 gives the average cyclomatic com- 
plexity for all modules within each of the size categories. 
The complexity relationships for executable lines of code 
within a module is shown in Figure 9. As can be seen from 
the table the larger modules were more complex than smaller 
modules. 


Module size 

Average Cyclomatic Complexity 

50 

6.0 

100 

17.9 

150 

28.1 

200 

52.7 

>200 

60.0 

AVERAGE CYCLOMATIC 

COMPLEXITY FOR ALL MODULES 
Table 5 
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MOOCLE SIZE 


Figure 9 


For only those modules containing errors, Table 6 gives 
the number of errors/ 1000 executable statements and the 
average cyclomatic complexity. When this data is compared 
with Table 5 , one can see that the average complexity of 
the error prone modules was no greater than the average com- 
plexity of the full set of modules. 


I 

I 

I 

I 

I 

I 

I 


4-96 


Module 

Size Average Cyclomatic 

Errors/ 1 000 


Complexity 

executable lines 

50 

6.2 

65.0 

100 

19.6 

33.3 

150 

27.5 

24.6 

200 

56.7 

13.4 

>200 

77.5 

9.7 

COMPLEXITY AND ERROR RATE FOR ERRORED MODULES 

Table 6 


4.0 DATA NOT EXPLICITLY INCLUDED ^ ANALYSES 

The 147 modules not included in this study (i.e., 
assembly segments, common segments, utility routines) con- 
tained a total of six errors. These six errors were 
detected within three different segments. One error 
occurred in a modified assembly module and was due to the 
misunderstanding or incorrect statement of the functional 
specifications for the module. The effort needed to correct 
this error was minimal (1 hour or less). 

The other five errors occurred in two separate new data 
segments with the major cause of the errors also being 
related to their specifications. The effort needed to 
correct these errors was on the average from 1 hour to 1 day 
(1 day representing 8 hours). 


5.0 CONCLUSIONS 


The data contained in this paper helps explain and 
characterize the environment in which the software was 
developed. It is clear from the data that this was a new 
application domain in an application with changing require- 
ments. 

Modified and new modules were shown to behave similarly 
except in the types of errors prevalent in each and the 
amount of effort required to correct an error. Both had a 
high percentage of interface errors, however, new modules 
had an equal number of errors of omission and commission and 
a higher percentage of control errors. Modified modules had 
a high percentage of errors of commission and a small per- 
centage of errors of omission with a higher percentage of 
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data and initialization errors. Another difference was that 
modified modules appeared to be more susceptible to errors 
due to the misunderstanding of the specifications. 
Misunderstanding of a module's specifications or require- 
ments constituted the majority of errors detected. This 
duplicates an earlier result of Endres which implies that 
more work needs to be done on the form and' content of the 
specifications and requirements in order to enable them to 
be used across applications more effectively. 

There were shown to be some disadvantages to modifying 
an existing module for use instead of creating a new module. 
Modifying an existing module to meet a similar but different 
set of specifications reduces the developmental costs of 
that module. However, the disadvantage to this is that 
there exists hidden costs. Errors contained in modified 
modules were found to require more effort to correct than 
those in new modules, although the two classes contained 
approximately the same mmber of errors. The majority of 
these errors was due to incorrect or misinterpreted specifi- 
cations for a module. Therefore, there is a tradeoff 
between minimizing development time and time spent to align 
a module to new specifications. However, if better specifi- 
cations could be developed it might reduce the more expen- 
sive errors contained within modified modules. In this 
case, the reuse of "old" modules could be more beneficial in 
terms of cost and effort since the hidden costs would have 
been reduced. 

One surprising result was that module size did not 
account for error proneness. In fact, it was quite the con- 
trary, the larger the module the less error prone it was. 
This was true even though the larger modules were more com- 
plex. Additionally, the error prone modules were no more 
complex across size grouping than the error free modules. 

In general , investigations of the type presented in 
this paper relating error and other change data to the 
software in which they have occurred is important and 
relevant. It is the only method by which our knowledge of 
these types of relationships will ever increase and evolve. 
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ABSTRACT 


An effective data collection method for evaluating software development 
methodologies and for studying the software development process is 
described. The method uses goal~directed data collection to evaluate 
methodologies with respect to the claims made for them. Such claims 
are used as a basis for defining the goals of the data collection, 
establishing a list of questions of interest to be answered by data 
analysis, defining a set of data categorization schemes, and designing 
a data collection form. 

The data to be collected are based on the changes made to the software 
during development, and are obtained when the changes are made. To 
insure accuracy of the data, validation is performed concurrently with 
software development and data collection. Validation is based on 
interviews with those people supplying the data. Results from using 
the methodology show that data validation is a necessary part of change 
data collection. Without it, as much as 50% of the data may be 
erroneous. 

Feasibility of the data collection methodology was demonstrated by 
applying it to five different projects in two different environments. 

The application showed that the methodology was both feasible and useful. 
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A Methodology For Collecting Vedid Softweu"e 
Engineering Data 

Kctor R. BasUi 
University of Maryland 

David M. Weiss 
Naval Research Laboratory 


I. Introduction 

According to the mythology of computer science, the first computer pro- 
gram ever written contained an error. Error detection and error correction are 
now considered to be the major cost factors in software development [1.2,3]. 
Much current and recent research is devoted to finding ways of preventing 
software errors. This research includes areas such as requirements definition 
[4], automatic and semi-automatic program generation [5.6], functional 
specification [7], abstract specification [0,9,10.11], procedural specification 
[12], code specification [13,14,15], verification [16,17,10], coding techniques 
[19,20,21.22,23,24], error detection [25], testing [26,27], and language design 
[16. 20.29.30,31]. 

One result of this research is that techniques claimed to be effective for 
preventing errors are in abundance. Unfortunately, there have been few 
attempts at experimental verification of such claims. The purpose of this paper 
is to show how to obtain valid data that may be used both to learn more about 
the software development process and to evaluate software development metho- 
dologies in a production environment. Previous [15] and companion papers [32] 
present the data amd evaluation results. The methodology described in this 
paper was developed as part of studies conducted by the Naval Research Labora- 
tory and by NASA's Software Engineering Laboratory [33]. 

Software Engineering Experimentation 

The course of action in most sciences when faced with a question of opinion 
is to obtain experimental verification. Software engineering disputes are not 
usually settled that way. Data from experiments exist, but rarely apply to the 
question to be settled. There 2 ire a number of reasons for this state of affairs. 
Probably the two most important are the number of potential confounding fac- 
tors involved in software studies and the expense of attempting to do controlled 
studies in an industrial environment involving medium or large scale systems. 

Rather than attempting controlled studies, we have devised a method for 
conducting accurate causal analyses in production environments. Causal ana- 
lyses are efforts to discover the causes of errors and the reasons that changes 
are made to software. Such analyses are designed to provide some insight into 
the software development and maintenance processes, help confirm or reject 
claims made for different methodologies, and lead to better techniques for 
prevention, detection, and correction of errors. Relatively few examples of this 
land of study exist in the literature; some excimples are. [34, 35, 4, 15, 36] 

To provide useful data, a data collection methodology must display certain 
attributes. Since much of the data of interest for real projects are' collected 
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during the test phase, complete auialysis of the data must await project comple- 
tion. Although it is important that data collection and vedidation proceed con- 
currently with development, the final analysis must be done from a historical 
viewpoint, after the project ends. 

Developers can provide data as they make changes during development. In 
a reasonably well-controlled softweu*e development environment, documentation 
and code are placed under some form of configuration control before being 
released for use by others than the author. Changes are defined as alterations 
to baselined design, code or documentation. 

A key factor in the data gathering process is validation of the data as they 
become available. Such validity checks result in corrections to the data that 
cannot be captured at later times owing to the nature of human memory. [37] 
Timeliness of both data collection and data validation is quite important to the 
accuracy of the analysis. 

Careful validation means that the data to be collected must be carefully 
specified, so that those supplying data, those vedidating data, and those perform- 
ing the euialyses will have a consistent view of the data collected. This is espe- 
cially important for the purposes of those wishing to repeat studies in both the 
same and different environments. 

Careful specification of the data requires the data collectors to have a clear 
idea of the goals of the study. Specifying goals is itself an important issue, 
since, without goals, one runs the risk of collecting unrelated, meaningless data. 

To obtain insight into the software development process, the data collectors 
need to know the kinds of errors committed and the kinds of changes made. To 
identify troublesome issues, the effort needed to make each change is neces- 
sary. For greatest usefulness, one would like to study projects from software 
production environments involving teams of programmers. 

We may summarize the preceding as the following six criteria; 

1. the data must contaun information permitting identification of the 
types of errors and changes made, 

2. the data must include the cost of making changes and correcting 
errors , 

3. data to be collected must be defined as a result of clear specification 
of the goals of the study, 

4. data should include studies of projects from production environments, 
involving teams of programmers, 

5. data analysis should be historical, but data must be collected and vali- 
dated concurrently with development 

6. data classification schemes to be used must be carefully specified for 
the sake of repeatability of the study in the seune and different 
environments. 
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n. Schema For The Investigative Methodology 

Our data collection methodology is goal oriented. It starts with a set , of 
goals to be satisfied, uses these to generate a set of questions to be answered, 
2 uid then proceeds step-by-step through the design and implementation of a 
data collection and validation mechanism. Anedysis of the data yields answers to 
the questions of interest, and may also yield a new set of questions. The pro- 
cedure relies heavily on an interactive data validation process; those supplying 
the data are interviewed for validation purposes concurrently with the software 
development process. The methodology has been used in two different environ- 
ments to study five softwaire projects developed by groups with different back- 
grounds using very different software development methodologies. In both 
environments it 3 aelded answers to most questions of interest and some insight 
into the development methodologies used. 

The projects studied vary widely with respect to factors such as application, 
size, development team, methodology, hardware, and support software. 
Nonetheless, the same baisic data collection methodology was applicable every- 
where. The schema used heis six basic steps, listed in the following, with consid- 
erable feedback eind iteration occurring at several different places. 

1. Establish the goals of the data collection 

We divide goals into two categories; those that may be used to evaluate a 
particular software development methodology relative to the claims made for it, 
and those that are common to all methodologies to be studied. 

As an exEunple, a goal of a particular methodolpgy, such as information hid- 
ing [38], might be to develop software that is easy to change. The corresponding 
data collection goal is to evaluate the success of the developers in meeting this 
goal, i.e. evaluate the ease with which the software can be changed. Goals in this 
category may be of more interest to those who are involved in developing or 
testing a particular methodology, and must be defined cooperatively with them. 

A goal that is of interest regaurdless of the methodology being used is to 
cheu’acterize changes in ways that permit comparisons across projects and 
environments. Such goals may interest software engineers, programmers, 
managers, and others more than goals that aure specific to the success or failure 
of a particular methodology. 

Consequences of Omitting Goals 

Without goals, one is likely to obtain data in which either incomplete pat- 
terns or no patterns are discernible. As an example, one goal of an early study 
[15] was to characterize errors. During data analysis, it became desirable to 
discover the fraction of errors that were the result of changes made to the 
software for some reason other than to correct an error. Unfortunately, none of 
the goals of the study were related to this type of change, and there were no 
such data aveulable. 

2. Develop a list of questions of interest 

Once the goals of the study have been established, they may be used to 
develop a list of questions to be answered by the study. Questions of interest 
define data pareimeters and categorizations that permit quantitative einalysis of 
the data. In general, each goed will result in the generation of several different 
questions of interest. .As an example, if the goal is to characterize changes, 
some corresponding questions of interest are: "What is the distribution of 

changes according to the reason for the change?", "What is the distribution of 
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changes across system components?’’, ”What is the distribution of effort to 
design changes?” 

As a second example, if the goal is to eveduate the ease with which software 
can be changed, we may identify questions of interest such as; ”Is it clear where 
a change has to be made in the software?'. ’’Are changes confined to single 
modules?’, "What was the average effort involved in meiking a chamge?” 

Questions of interest form a bridge between subjectively-determined goals 
of the study and the quantitative measures to be used in the study. 'They permit 
the investigators to determine the quantities that need to be measured and the 
aspects of the goals that can be measured. As an example, if one is attempting 
to discover how a design document is being used, one might collect data that 
show how the document was being used when the need for a change to it was 
discovered. This may be the only aspect of the document's use that is measior- 
able. 

Goeds for which questions of interest cemnot be formulated and goals that 
ceinnot be satisfied because adequate measures cannot be defined may be dis- 
carded. Once formulated, questions cem be evaluated to determine if they com- 
pletely cover their associated goals and if they define quantitative measures. 
Finally, questions of interest have the desirable property of forcing the investi- 
gators to consider the data analyses to be performed before any data are col- 
lected. 

Consequences of Omitting Questions Of Interest 

Without questions of interest, there may be no quantitative basis for satisfy- 
ing the goals of the study. Data distributions that are needed for evaluation pur- 
poses, such as the distribution of effort involved in making changes, may have to 
be constructed in am ad hoc way. and be incomplete or inaccurate. 

3. EstaUish data categories 

Once the questions of interest have been established, categorization 
schemes for the changes and errors to be examined may be constructed. Each 
question generally induces a categorization scheme. If one question is, ”What 
was the distribution of changes according to the reason for the change?", one 
will want to classify changes according to the reason they are made. A simple 
categorization scheme of this sort is error corrections vs. nan-error corrections 
(hereafter called modifications). 

Each of these categories may be further subcategorized according to rea- 
son. As am example, modifications could be subdivided into those modifications 
resulting from requirements changes, those resulting from a change in the 
development support environment (e.g. compiler change), planned enhance- 
ments. optimizations, and others. 

Such a categorization permits characterization of the changes with respect 
to the stability of the development environment, with respect to different kinds 
of development activities, etc. When matched with another categorization such 
as the difficulty of making changes, this scheme also reveals which changes are 
the most difficult to make. 

Each categorization scheme should be complete and consistent, i.e. every 
change should fit exactly one of the subcategories of the scheme. To Insure 
completeness, the category "Other” is usually added as a subcategory. Where 
some changes are not suited to the scheme, the subcategory "Not Applicable" 
may be used. .As an example, if the scheme includes subcategories for different 
levels of effort in isolating error causes, then errors for which the cause need 
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not be Isolated (e g. clerical errors noticed when reading code) belong in the 
■'Not Applicable" subcategory. 

Consequences Of Not Defining Data Categories Before Collecting Data 

Omitting the data categorization schemes may result in data that cauinot 
later be identified as fitting any particular categorization. Each change then 
tends to define its own category, and the result is an overwhelming multiplicity 
of data categories, with little data in each category. 

4. Design and test data collection form 

To provide a permanent copy of the data and to reinforce the program- 
mers’ memories, a data collection form is used. Form design wzis one of the 
trickiest parts of the studies conducted, primarily because forms represent a 
compromise among conflicting objectives. Typical conflicts are the desire to 
collect a complete, detailed set of data that may be used to answer a wide range 
of questions of interest, and the need to minimize the time and effort involved in 
supplying the data. Satisfying the former leads to large, detailed forms that 
require much time to fill out. The latter requires a short form organized so that 
the person supplying the data need only check off boxes. 

Including the data suppliers in the form design process is quite beneficial. 
Complaints by those who must use the form are resolved early (i.e. before data 
collection begins), the form may be tailored to the needs of the data st^>pliers 
(e.g. for use as in configuration management), and the data suppliers feel they 
are a useful part of the data collection process. 

The forms mu::t be constructed so that the data they contain can be used to 
answer the questions of interest. Several design iterations and test periods are 
generally needed before a satisfactory design is found. 

Our principal goals in form design were to produce a form that; 

1. fit on one piece of paper, • 

2. could be used in several different programming environments, and 

3. permitted the programmer some flexibility in describing the 
change. 

Figure 1 shows the last version of the form used for the SEL studies. (An 
earlier version of the form was significantly modified as a result of experience 
gained in the data collection and analysis processes.) The first sections of the 
form request textual descriptions of the change and the reason it was made. 
Following sections contain questions and check-off tables that reflect various 
categorization schemes. 

As an example, a categorization of time to design changes is requested in 
the first question following the description of the change. The completer of the 
form is given the choice of 4 categories (one hour or less, one hour to one day, 
one day to three days, and more than three days) that cover ail possibilities for 
design time. 

Consequences Of Not Using A Data Collection Form 

Without a data collection form, it is necessary to rely on the developer's 
memories and on perused of eeirly versions of design documentation and code to 
identify and categorize the changes made. This approach leads to incomplete, 
inaccurate data. 
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CHANGE REPORT FORM 


number 


PROJECT MAMg CURRENT DATE 


SECTION A - IDENTIFICATION 
REASON: Why wn th« ehangt mmHm? 

DESCRIPTION: Whit chin^i was 


EFFECT; Whac compontnts (or documtntsl aro changtd? (Inckidi vtrstonl 


EFFORT; Whit additional components (or documents) were examined in determining what change was needed?. 

(Month Day Year) 

Need for change determined on ... . 

Change started on 

What was the effort in person time required to understand and implement the change? 

1 hour or less, _ 1 hour to 1 day, 1 day to 3 days. more than 3 days 


SECTION B - TYPE OF CHANGE (How is this change best characterized?) 


G Error correction 


G lns«rtion?delction of debug code 

Q Planned enhancement 

- 

G Optimization of time/space/accuracy 

G Implementation of requirements change 


G Adaptation to environment change 

G Improvement of clarity, maintainabiliry, or documentation 


G Other (Explain in E) 

G Improvement of user services 




Wat more than one component affected by the change? Yes No 


FOR ERROR CORRECTIONS ONLY 
SECTION C - TYPE OF ERROR (How is this error best characterized?) 

Requirements incorrect or misinterpreted Q Misunderstanding of external environment, except language 

Functional specifications incorrect or misinterpreted G Error in use of programming language/oompiier 

Design error, involving several components G Clerical error 

Error in the design or implementation of a single component G Other (Explain in E) 

FOR DESIGN OR IMPLEMENTATION ERRORS ONLY 
If the error was in design or implementation: 

The error was a mistaPeen assuiriqtion about the value or structure of data . - - 

The error was a mistake tn control logic or computation of an 
sao 7 (S/78) 

Figure 1 SEL Change Report Form (front ) 
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FOR ERROR CORRECTIONS ONLY 
SECTION 0 - VALIDATION ANO REPAIR 
Wh«t activities vfn used to validate tfie program, detect tfit error, and find its cause? 


Activities 

1 

Activities 

Activities 

Activities 

Used for 

Successful 

Tried to 

Successful 

Program 

in Detecting 

Find 

in Finding 

Validation 

Error Symptoms 

Cause 

Cause 


Pre*aceeo fence test runs 


Acceptance testing 


Post* acceptance use 


Inspection of o*iCput 


Talks with other programmers 


Special debug code 


System error messages 


Project specific error messages 


Reading dorumenution 


Trace 


Oumo 


Crott-reOnence/artnbute list 


Proof technique 



What was the time used to isolate the cause? 

_one hour or less. _one hour to one day# — 
If never found, was a workaround ufd> — 

VVm this error related to a previous change? 

__Yes (ClMngt Report d/Oate ) 

’.Vhen did the error enter the system? 

....requirements ....functional specs ..design 


.more than one day* . 
... No (Explain in E> 


.Can't tell 


.never found 


.coding and test - other 


.can't tell 


SECTION E - AOOITIONAL INFORMATION 

Pleete give any information that may be helpful in categorizing the error or cnange. and understanding its cause and its 
ramifications. 




















5. Collect and vadidate data 

Data are collected by requiring those people who are making software 
chainges to complete a change report form for each change made> as soon as the 
the change is completed. Validation consists of checking the forms for correct- 
ness. consistency, and completeness. As part of the validation process, in cases 
where such checks reveed problems the people who filled out the forms are 
interviewed. Both collection and validation are concurrent with software 
development; the shorter the lag between programmers completing forms and 
being interviewed concerning those forms, the more accurate the data. 

Perhaps the most significant problem during data collection and validation 
is insuring that the data are complete, i.e. that every change has been described 
on a form. The better controlled the development process, the easier this is to 
do. At each stage of the process where configuration control is imposed, change 
data may be collected. Where projects that we have studied use formal 
configuration control, we have integrated the configuration control procedures 
and the data collection procedures, using the same forms for both, and taking 
advantage of configuration control procedures for validation purposes. Since all 
changes must be reviewed by a configuration control board in such cases, we are 
guaranteed capture of edl changes, i.e. that our data are complete. Further- 
more, the data collection overhead is absorbed into the configuration control 
overhead, and is not visible as a separate source of irritation to the developers. 

Consequences Of Omitting Validation 

One result of concurrent development, data collection, euid data validation 
is that the accuracy of the collection process may be quantified. Accuracy may 
be calculated by observing the number of mistakes made in completing data col- 
lection forms. One may then compare, for any data category, pre-validation dis- 
tributions with post-validation distributions. We call such an analysis a valida- 
tion analysis. The validation analysis of the SEL data shows that it is possible for 
inaccuracies on the order of 50% to be introduced by omitting validation. To 
emphasize the consequences of omitting the validation procedures, we present 
some of the results of the validation analysis of the SEL data in'section III. 

6. Analyze Data 

Data are analyzed by cedculating the parameters and distributions needed 
to answer the questions of interest. As an example, to answer the question 
"What was the distribution of changes according to the reason for the change?”, 
a distribution such as that shown in figure 2 might be computed from the data. 

^>pUcatioQ of the Schema 

Applying the schema requires iterating among the steps several times. 
Defining the goals and establishing the questions of interest are tightly coupled, 
as are establishing the questions of interest, designing and testing the form(s), 
and collecting and validating the data. Many of the considerations involved in 
implementing and integrating the steps of the schema have been omitted here 
so that the reader may have an overview of the process. The complete set of 
goals, questions of interest, and data categorizations for the SEL projects are 
shown in a companion paper [32]. 
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Support Procedures aud Facilities 

In addition to the activities directly involved in the data collection effort, 
there are a number of support activities and facilities required. Included as 
support activities are testing the forms, collection, and validation procedures, 
training the prograunmers, selecting a data base system to permit easy analysis 
of the data, encoding and entering data into the data base, and developing 
analysis prograuns. 

m Details Of b'KL Data Collection And Validation 

In the SEL environment, prograun libraries were used to support and control 
software development. There was a full-time librarian assigned to support SEL 
projects. All project library changes were routed through the librarian. In gen- 
eral, we define a change to be an alteration to baselined design, code, or docu- 
mentation. For SEL purposes, only changes to code, and documentation con- 
tained in the code, were studied. The program libraries provided a convenient 
mechanism for identif}nng changes. 

Each time a programmer caused a library change, he was required to com- 
plete a chainge report form (figure 1). The data presented here are drawn from 
studies of three different SEL projects, denoted SELl, SEL2, and SEL3. The pro- 
cessing procedures were ais follows. 

1. Programmers were required to complete change report forms for all 
changes made to library routines. 

2. Programs were kept in the project library during the entire test phase. 

3. After a change was made a completed change report form describing 
the change was submitted. The form was first informally reviewed by 
the project leader. It was then sent to the SEL library staff to be 
logged and a unique identifier assigned to it. 

4. The change analyst reviewed the form and noted any inconsistencies, 
omissions, or possible miscategorizations. Any questions the analyst 
had were resolved in an interview with the programmer. (Occasionally 
the project leader or system designer was consulted rather than the 
individual programmer.) 

5. The change analyst revised the form as indicated by the results of the 
programmer interview, and returned it to the library staff for further 
processing. Revisions often involved cases where severed changes were 
reported on one form. In these cases, the analyst insured that there 
was only one change reported per form; this often involved filling out 
new forms. Forms created in this way are known eis gvneraied forms, 

(Changes were considered to be different if they were made for 
different reasons, if they were the result of different events, or if they 
were made at substantially different times (e.g. several weeks apart). 
As an example, two different requirements amendments would result in 
two different change reports, even if the changes were made at the 
same time in the same subroutine.) 
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Occasionally, one change was reported on several different forms. The 
forms were then merged into one form, again to Insure one and only 
one change per form. Forms created in this way are known as com- 
bined forms. 

6. The libraury stadi encoded the form for entry into the (automated) SEL 
data base. A preliminary, automated check of the form was made via a 
set of data base support programs. This check, mostly syntactic, 
ensured that the proper kinds of values were encoded into the proper 
fields, e.g. that an alphabetic character was not entered where an 
integer was required. 

7. The encoded data were entered into the SEL data base. 

8. The data were anal3rzed by a set of programs that computed the neces- 
sary distributions to answer the questions of interest. 

Many of the reported SEL changes were error corrections. We define an 
error to be a discrepancy between a specification and its implementation. 
Although it was not always possible to identify the exact location of an error, it 
was always possible to identify exactly each error correction. As a result, we 
generally use the term error to mean error correction. 

For data validation purposes, the most importeuit parts of the data collec- 
tion procedure are the review by the change analyst, and the associated pro- 
grammer interview to resolve uncertainties about the data. 

The SEL vedidation procedures afforded a good chance to discover whether 
validation was really necessary; it was possible to count the number of mis- 
categorizations of changes and associated misinformation. These counts were 
obtained by counting the number of times each question on the form was 
incorrectly answered. 

An example is mis classifications of errors as clerical errors. (Clerical errors 
were defined as errors that occur in the mechanical translation of an item from 
one format to another, e.g. from one coding sheet to another, or from one 
medium to another, e.g. coding sheets to cards.) For one of the SEL projects. 46 
errors originally classified as clericed were actually errors of other types. (One 
of these consisted of the programmer forgetting to include several lines of code 
in a subroutine. Rather them clerical, this was cleissified as an error in the 
design or implementation of a single component of the system.) Initially, this 
project reported 230 changes, so we may say that about 19% of the original 
reports were misclassified as clerical errors. 

The SEL validation process was not good for verfiying the completeness of 
the reported data. We cannot tell from the validation studies how many changes 
were never reported. This weakness can be eliminated by integrating the data 
collection with stronger configuration control procedures. 

Validation Differences Among SEL Projects 

As experience was gained in collecting, validating, and analyzing data for 
the SEL projects, the quality of the data improved significamtly, and the valida- 
tion" procedures changed slightly. For SELl emd SEL2, completed forms were 
examined and programmers interviewed by a change analyst within a few weeks 
(typically 3 to 6 weeks) of the time the forms were completed. For project SEL2, 
the task leader (lead programmer for the project) examined each form before 
the change analysts saw it. 
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Project SEL3 was not monitored as closely as SELl and SEL2. The task 
leader, who was the same as for SEL2, by then understood the data categoriza- 
tion schemes quite well and again examined the forms before sending them to 
the SEL. The forms themselves were redesigned to be simpler but still capture 
nearly all the same data. Finally, several of the programmers were the same as 
on project SEL2 and were experienced in completing the forms. 

Estimating Inaccuracies In Hie Data 

Although there is no completely objective way to quantify the inaccuracy in 
the vedidated data, we believe it to be no more than 5% for SELl auid SEL 2. By 
this we mean that no more than 57Z of the changes and errors are misclassified 
in any of the data collection categories. For the major categories, such as 
whether a change is an error or modification, the type of change, and the type of 
error, the inaccuracy is probably no more than 3%. 

For SEL3, we attempted to quantify the results of the validation procedures 
more carefully. After vsdidation, forms were categorized according to our 
confidence in their accuracy. We used four categories: 

(1) Those forms for which we had no doubt concerning the accuracy of 
the data. Forms in this cateogry were estimated to have no more 
than a 1% chance of inaccuracy. 

(2) Those forms for which there was little doubt about the accuracy of 
the data. Forms in this category were estimated to have at most a 
10/5 chance of an inaccuracy. 

(3) Those forms for which there was some uncertaincy about the accura- 
cy, with an estimated inaccuracy rate of no more than 30%. 

(4) Those forms for which there was considerable uncertaincy about the 
accuracy, with an estimated inaccuracy rate of about 50%. 

implying the inaccuracy rates to the number of forms in each category gave us 
an estimated inaccuracy of at most 3% in the vedidated forms for SEL3. 

Prevalent Mistakes In Completing Forms 

Clear patterns of mistadces and misclassifications in completing forms 
became evident during validation. As an example, programmers on projects 
SELl and SEL2 frequently included more than ono change on one form. Often 
this was a result of the programmers sending the cheuiges to the library as a 
group. 

Comparative Validation Results 

Figure 3 provides an overview of the results of the validation process for the 
3 SEL projects The percentage of original forms that had to be corrected as a 
result of the validation process is shown. As an example, 32% of the originally 
completed change report forms for SEL3 were corrected as a result of valida- 
tion. The percentages are based on the number of original forms reported 
(since some forms were generated, and some combined, the number of changes 
reported after validation is different than the number reported before valida- 
tion). Figure 4 shows the fraction of generated forms expressed as a percentage 
of total validated forms. 

Figure 3 shows that pre-validation SEL3 forms were significantly more accu- 
rate than the pre-validation SELl or SEL2 forms. When the generated and com- 
bined forms are also considered, the pre-validation SEL3 data appear to be con- 
siderably better then the pre-validation data for either of the other projects. We 
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believe the reasons for this are the improved design of the form, and the fami- 
liarity of the task leader and progrzunmers with the data collection process. 
(Generated forms are shown in figure 4. Combined forms for all of the projects 
represented a very small fraction of the total validated forms.) 

These (overall) results show that careful vedidation, including programmer 
interviews, is essential to the accuracy of any study involving chan g e data. 
Furthermore, it appears that with well-designed forms, and programmer train- 
ing. there is improvement with time in the accuracy of the data one can obtain. 
We do not believe that it will ever be possible to dispense entirely with program- 
mer interviews, however. 

Erroneous Classifications 

Table 1 shows misclassifications of error as modifications and modifications 
as errors. As an' example, for SELl.14% of the original forms were classified as 
modifications, but were actually errors. Without the v 2 didation process, consid- 
erable inaccuracy would have been introduced into the Initial categorization of 
changes as modifications or errors. 

Table 2 is a sampling of other kinds of classification errors that could con- 
tribute significantly to inaccuracy in the data. All Involve classification of an 
error into the wrong subcategory. The first row shows errors that were classified 
by the programmer as clerical, but were later reclassified as a result of the vali- 
dation process into another category. For SELl. significant inaccuracy (19%) 
would be introduced by omitting the validation process. 

Table 3 is simileu: to table 2. but shows misclassifications involving 
modifications. The first row shows modifications that were cleissified by the pro- 
grammer as requirements or specifications changes, but were reclassified as a 
result of validation. 

Variation In Misclasslfication 

Data on misclassifications of change and error type sxibcategories. such as 
shown in table 2. tends to vary considerably among both projects and sub- 
categories. (Misclasssification of clerical errors as shown in table 2 is a good 
example.) This is most likely because the misclassifications represent biases in 
the judgements of the programmers. It became clear during the validation pro- 
cess that certain programmers tended toward particular misclassifications. 

The consistency between projects SEL2 and SEL3 in table 2 probably occurs 
because both projects had the same task leader, who screened all forms before 
sending them to the SEL for validation. 
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Coaclusioos Concerning Validation 

The preceding sections have shown that the vedidation process, particularly 
the programmer interviews, are a necessary part of the data collection metho- 
dology. Inaccuracies on the order of 50% may be introduced without this form of 
validation. Furthermore, it appears that with appropriate form design and pro- 
grammer experience in completing forms, the inaccuracy rate may be substan- 
tially reduced, although it is doubtful that it cam be reduced to the level where 
programmer interviews may be omitted from the validation procedures. 

A second significant conclusion is that the analysis performed ais part of the 
validation process may be used to guide the data collection project; the analysis 
results show what data cam be reliably and practically collected, amd what data 
camnot be. Data collection goals, questions of interest, amd data collection forms 
may have to be revised accordingly. 

IV. Recommendations For Dstta Cc^ectors 

yte believe we now have sufficient experience with change data collection to 
be able to apply It successfully in a wide vau*iety of environments. Although we 
have been able to make comparisons between the data collected in the two 
environments we have studied, we would like to make comparisons with a wider 
variety of environments. Such comparisons will only be possible if more data 
become available. To encourage the establishment of more data collction pro- 
jects, we feel it is important to describe a successful data collection methodol- 
ogy, eis we have done in the preceding sections, to point out the pitfalls involved, 
and to suggest ways of avoiding those pitfalls. 

Procedural Lessons Learned 

Problems encountered in various procedural aspects of the studies were 
the most difficult to overcome. Perhaps the most important are the following. 

1. Clearly understanding the working environment and specifying the 
data collection procedures were a key part of conducting the investiga- 
tion. Misunderstanding by the programmer of the circumstances that 
require him/her to file a change report form will prejudice the entire 
effort. Prevention of such misunderstandings can partly be accom- 
plished by tr ainin g procedures and good forms design, but feedback to 
the development staff, i.e. those filling out the data collection forms, 
must not be omitted. 

2. Similarly, misunderstauiding by the change analyst of the cir- 
cumstances that required a cheinge to be made will result in 
misclassifications and erroneous analyses. Our SEL data collection was 
helped by the use of a change analyst who had previously worked in the 
NASA environment and understood the application and the develop- 
ment procedures used. 

3. Timely data validation through interviews with those responsible for 
reporting errors and changes was vital, especiadly during the first few 
projects to use the forms. Without such validation procedures, data 
will be severely biased, and the developers will not get the feedback to 
correct the procedures they are using for reporting data. 

4. Minimizing the overhead imposed on the people who were required to 
complete change reports was an important factor in obtaining com- 
plete and accurate data. Increased overhead brought increased reluc- 
tance to supply and discuss data. In projects where data collection has 
been integrated with configuration control, the visible data collection 
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and validation overhead is signihcauitly decreased, and is no longer an 
important factor in obtaining complete data. Because configuration 
control procedures for the SEL environment were informal, we believe 
we did not capture all SEL changes. 

5. In cases where an automated data base is used, data consistency and 
accuracy checks at or immediately prior to ainedysis are vital. Errors in 
encoding data for entry into the data base will otherwise bias the data. 

Nonprocedural Lessons Learned 

In addition to the procedural problems involved in desinging and imple- 
menting a data collection study, we found several other pitfalls that could have 
strongly affected our results and their interpretation. They are listed in the fol- 
lowing. 

1. Perhaps the most significant of these pitfalls was the danger of inter- 
preting the results without attempting to understand factors in the 
environment that might affect the data. As an example, we found a 
surprisingly small percentage of interface errors on all of the SEL pro- 
jects. This result was surprising since interfaces are an often-cited 
source of errors. There was adso other evidence in the data that the 
software was quite amenable to change. In trying to understand these 
results, we discussed them with the principal designer of the SEL pro- 
jects (all of which had the same application), it was clear from the dis- 
cussion that as a result of their experience with the application, the 
designers had learned what changes to expect to their systems, organ- 
ized the design so that the expected changes would be easy to make, 
and then re-used the design from one project to the next. Rather than 
misinterpreting the data to mean that interfaces were not a significant 
software problem, we were led to a better vinderstanding of the 
environment we were studying. 

2. A second pitfall was underestimating the resources needed to validate 
and analyze the data. Understanding the change reports well-enough 
to conduct meaningful, efficient progreunmer interviews for validation 
purposes initially consumed considerable amounts of the change 
analysts' time. Verifying that the data base was internally consistent, 
complete, and consistent with the paper copies of reports was a con- 
tinuing source of frustration and sink for time and effort. 

3. A third potential pitfall in data collection is the sensitivity of the data. 
Programmers emd designers sometimes need to be convinced that 
error data will not be used against them. This did not seem to be a 
significant problem on the projects studied for a variety of reasons, 
including management support, processing of the error data by people 
independent of the project, identifying error reports in the analysis 
process by number rather than name. Informing newly hired project 
personnel that completion of error reports was considered part of 
their job, and high project morale. Furthermore, project management 
did not need error data to evaluate performance. 

4. One problem for which there is no simple solution is the Hawthorne (or 
observer) effect [39]. When project personnel become aware that an 
aspect of their behavior is being monitored, their behavior will change. 
If error monitoring is a continuous, long-term activity that is part of 
the normal scheme of software development, not associated with 
evaluation of programmer performance, this effect may become 
insignificant. We believe this weis the case with the projects studied. 
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5. The sensitivity of error data is enhanced in an environment where 
development is done on contract. Contractors may feel that such data 
are proprietary. Rules for data collection may have to be contractually 
specified. 

Avoiding Data Collection Pitfalls 

In the foregoing sections a number of potential pitfalls in the data collec- 
tion process have been described. The following list includes suggestions that 
help avoid some of these pitfalls. 

1. Select change analysts who ea*e familiar with the environment, applica- 
tion, project, and development team. 

2. Establish the goads of the data collection methodology and define the 
questions- of interest before attempting any data collection. Establish- 
ing goals and defining questions should be an iterative process per- 
formed in concert with the developers. The developers’ interests are 
then served as well as the data collector’s. 

3. For initial data collection efforts, keep the set of data collection goals 
small. Both the volume of data auid the time consumed in gathering, 
validating, and analyzing it will be unexpectedly large. 

4. Design the data collection form so that it may be used for configuration 
control, so that it is tedlored to the project(s) being studied, so that the 
data may be used for comparison purposes, and so that those filling 
out the forms understaind the terminology used. Conduct training ses- 
sions in filling out forms for newcomers. 

5. Integrate data collection and validation procedures into the 
configuration control process. Data completeness euid accuracy are 
thereby improved, data collection is unobtrusive, and collection and 
validation become a paurt of the normal development procedures. In 
cases where configuration control is not used or is informal, allocate 
considerable time to programmer interviews, and, if possible, docu- 
mentation seeu'ch and code reading. 

6. Automate as much of the data analysis process as possible 

limitations 

It has been previously noted that the main limitation of using a goal- 
directed data collection approach in a production software environment is the 
inability to isolate the effects of single factors. For a variety of reasons, con- 
trolled experiments that may be used to test hypotheses concerning the effects 
of single factors do not seem practical. Neither can one expect to use the 
chauige data from goal-directed data collection to test such hypotheses. 

A second major limitation is that lost data cannot be accurately recap- 
tured. The data collected as a result of these studies represent five years of 
data collection. During that time there was considerable and continuing con- 
sideration given to the appropriate goals and questions of interest. Nonetheless, 
as data were anedyzed, it became clear that there was information that was 
never requested but that would have been useful. An example is the length of 
time each error remained in the system. Programmers correcting their own 
errors, which was the usual case, can supply this data easily at the time they 
correct the error Our attempts to discover error entry and removal times after 
the end of development were fruitless. (Error entry times were particularly 
difficult to discover.) Given such data, one could isolate errors that were not 
easily susceptible to detection. This type of example underscores the need for 
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careful planning prior to the start of data collection. 


Recommeadatlcms That May Be Provided To the Software Developer 

The nature of the data collection methodology euid the environments in 
which it can be used do not generally permit isolation of the effects of particular 
factors on the software development process. The results cannot be used to 
suggest that controlling a particular factor in the development process will 
reduce the quantity or cost of peu’ticular kinds of errors. We have found that the 
patterns found in the data do suggest that cerUdn approaches, when applied in 
the environment studied, will improve the development process. 

As an example, in the SEL environment neither external problems, such as 
requirements changes, nor globed problems, such as interface design and 
specification, were significant. Furthermore, the development environment was 
quite stable. Most problems were associated with the individual programmer. 
The data show that in the SEL environment it would clearly pay to impose more 
control on the process of composing individual routines. Since detecting and 
correcting most errors was apparently quite easy in the overwhelming majority 
of cases, more attention should be paid to preventing errors from entering the 
code initially. 


Conclusions Concerning Data Collection For Methodology Evaluation Purposes 


The dsitSL coU^ction scli8!.ii0i prsscntsd iisis 

jects in two different environments. We have been able to draw the following 
conclusions as a result of designing and implementing the data collection 
processes. 




1. In all cases, it has been possible to collect data concurrently with the 
software development process in a software production environment. 

2. Data collection may be used to evaluate the application of a particular 
software development methodology, or simply to learn more about the 
software development process. In the former case, the better defined 
the methodology, the more precisely the goals of the data collection 
may be stated. 

3. The better controlled the development process, the more accurate and 
complete the data. 

4. For all projects studied, it has been necessary to validate the data, 
including interviews vdth the project developers. 

5. As patterns are discerned in the data collected, new questions of 
interest emerge. These questions may not be answerable with the 
available data, and may require establishing new goals and questions of 
interest. 


Motivations For Conducting ^milar Studies 

The difficulties involved in conducting large scale controlled software 
engineering experiments have as yet prevented evaluations of softwaure develop* 
ment methodologies in the environments where they are often claimed to work 
best. .As a result, software engineers must depend on less formal techniques 
that can be used in real working environments to establish long-term trends. We 
view change analysis as one such technique and feel that more techniques, and 
many more results obtained by applying such techniques, are needed. 
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Abstract 

The Software Engineering Laboratory has been monitoring software development at 
NASA Goddard Space Flight Center since 1976. This report describes the data collec* 
tion activities of the Laboratory and some of the diflficulties of obtaining reliable data. 

In addition, the application of this data collection process to a current prototyping 
experiment is reviewed. 

L INTRODUCTION 

There is a significant need to collect reliable data on software development projects in order 
to provide an empirical basis for making conclusions about software development methodologies, 
models and tools. However, such data is usually hard to collect and even harder to evaluate. 

Software is a multibillion dollar industry where 100% cost overruns are common, and mainte- 
nance activities can take up to 70% of the total cost of the system [llj. The availability of reli- 
able data to evaluate competing software development techniques is crucial. 

As Lord kelvin stated, ”I often say that when you can measure what you are speaking 
about, and express it in numbers, you can know something about it, but when you cannot express 
it in numbers, your knowledge is of a meager and unsatisfactory kind.” The lack of adequate 
measures is certainly a problem in the software industry today. 

Many of the recent analyses of the software development process are based on data that is 
obtained from university experiments. Students often program special problems whose results are 
subjected to analysis. This gives the researcher the 10 to 100 data points necessary for statistical 
validity of the results. However, by virtue of being part of an academic program, such experi- 
ments are necessarily small and usually involve inexperienced programmers. There is a need to 
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extend the scope of these experiments to a level appropriate to the muitibillion dollar industry. 

Most software development data in industry has been collected after the fact. That is, a 
project is built and then a pile of documents are handed to a research group for evaluation. 
Often, critical information is missing and the results are not what one would expect. Rather than 
following the model of archeology - the study of dead software projects, software evaluation must 
model sociology - the study of living software societies. Data must be collected from ongoing pro- 
jects, but the software sociologists must not impact the objects of their study. Given the need to 
finish projects on time and within budgets - a goal too often missed - it is difficult to justify 
spending money on data collection and evaluation activities. 

Specifically to address these problems, the Software Engineering Laboratory (SEL) was set 
up within NASA Goddard Space Flight Center in 1976. The goal was to study software develop- 
ment activities within NASA and report on experiences that will improve the process. This report 
describes the SEL and its experiences over the last six years. 

n. THE SOFTWARE ENGINEERING LABORATORY 

In 1976 the SEL was organized to study software development within the NASA environ- 
ment. More specifically, its primary charter was to monitor the development of ground support 
software for unmanned spacecraft. Each such system was typically 30,000 to 50,000 source lines of 
Fortran and took from 8 to 10 programmers up to two years to build. While thb environment is 
not representative of all software development environments, SEL experiences are generalizable in 
some respects: 

a) Ground support software includes several program types such as data base functions, real 
time processing, scientific calculations and control language functions. The software is largely 
implemented in Fortran. 

b) By looking at a relatively narrow environment, data collected from many projects can be 
compared. Thus we get some of the benefits of a carefully controlled e.xperiment without the 
expense of duplicating large developments. We do not have the problem of looking at a variety of 


5-28 


projects, like compilers, COBOL programs, ground support software, MIS programs and then try- 
ing to say something consistent about all of these. 

To date, 46 projects have been studied, containing over 1.8 million lines of code. Over 150 
programmers participated in these projects, and the data base contains over 40 million bytes of 
data. The general SEL strategy is t carefully monitor a project and regularly collect data during 
its development. The data is then entered in the SEL data base for analysis. The purpose of this 
report is not to dwell on specific research results based on this data (See, for example, [8] for a 
collection of published papers about the SEL) but is concerned with the problems of collecting 
data, and what we have learned from this process. 

ra. DATA COLLECTION 

ra.l MODEL GENERATION 

In order to fully take advantage of the available data, it must be known what information is 
desired. The models and measures that are to be investigated must be defined. A random data 
collection activity will usually miss relevant data, and then it will be too late to try and recover 
that information. 

In the SEL, two classes of measures were identified for study, and the data collection activi- 
ties were oriented around those areas. The initial activities included: 

a) Process Measures. Evaluating personnel and computer resources over time was a clear 
need. One activity was to try and validate models that others have identifie (e.g., the Putnam 
Norden Rayleigh curve [l]) while another activity was to try and build new models to fit the 
empirical data (e.g., the Parr curve [7]). Once models were identified, their predictive nature was 
studied as a means of resource scheduling. 

The generation and correction of errors is another activity that has important economic 
consequences. However, few models are available to build upon, so there was a need to develop 
new models of errors and investigate their effects upon performance. 
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b) Product Measures. The size, structure, and complexity of software are other important 
economic factors to consider. The evaluation of measures such as the software science measures 
of Halstead [5j, the cydomatic complexity of McCabe [6] and other measures developed within the 
SEL was another early goal. 

Reliability is a critical activity in most environments. In our particular environment, the 
software that was previously developed was highly reliable (typically under 10 errors in an opera- 
tional system), so that reliability, while important, was not a primary driving force in organizing 
the SEL. 

in.2 FORMS GENERATION 

The first process in evaluating empirical data is the data collection activity. Ideally, you 
would like the process to be automated and transparent to the programmer. However, this was 
not possible in this situation. We were interested in the human activities of software develop- 
ment. Thus we needed detailed information about how programmers spend their time. Because of 
this, a decision made early in the life of the SEL was that some data would be manually collected 
using a series of forms. 

There is a significant tradeoff consideration at this point. If we tried to collect too much 
information, programmers would object to the interference of the data collection activity on their 
work. If too little information was asked, then there would be little point in collecting it. 

We first developed an initial set of reporting forms. These have been revised several times 
since then. Each time certain fields were clarified and the amount of information sought decreased 
somewhat. At the present time, the effort required to fill out the forms is not significant. Initially 
seven forms were developed. However, only three are used heavily. These seven forms are: 

a) Resource Summary. This form lists the number of hours per week spent by all personnel 
on the project. This information is obtained mostly from the weekly time cards supplied by the 
contractor. It is easy to obtain this data, and causes little overhead to a project. However, it is 
very useful for monitoring global resource expenditures, especially in conjunction with the follow- 
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ing Component Status Report. 


b) Component Status Report. This form is submitted weekly by each programmer. It lists 
for each component of the system (e.g., Fortran subroutine) the number of hours spent on each of 
nine categories (e.g., design, code, test, review, etc.). The detail required by this form initially 
caused some concern; however, in looking over past forms the average programmer worked on 
only 5 to 10 components per week and only 2 or 3 activities per component. Thus the overhead 
was not excessive. While the data is only approximate to the nearest hour, we believe that it is 
more accurate than many other data collection procedures. 

For example, many research papers give percentages for design, code, and test on a project. 
However, these are usually taken from resource summary data and calendar date milestones. If a 
design review occurs on a Friday, then all activities up until that date are design, with all activi- 
ties the next week being code. In the SEL environment, there was approximately a 25 percent 
error in using calendar dates for percent effort [4j. On four projects, approximately 25 percent of 
the design occurred during the coding phase, while almost half of the testing occurred prior to the 
testing phase (Figure l).The Component Status Report is critical for a proper view of develop- 
ment activities. 

c) Change Report Form. This form is completed after each change to a component is com- 
pleted and tested. Due to the number of changes that a component undergoes during early 
development, there was no attempt to capture this data before the component was ’’complete” 
(i.e., through unit test). Note that we are capturing ’’changes” and not simply ”errors.” All 
modifications, due to errors or other considerations such as enhancements, are tracked. 

Besides identifying the type of change, this form also identifies the cause of the change - 
they are not always the same, although programmers have difficulty separating the two. The form 
also asks for information on the time to find and correct an error, and what tools and techniques 
were used in the process. 

In some environments, the introduction of this form might cause programmers to object; 
however, this was not the case in our environment. A standard change monitoring procedure was 
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in place, so we simply changed the form that this branch ofNASA GSFC was using before the 
SEL was created. 

These three forms provide the most important data collected by the SEL. Four other forms 
have been created and used with limited success. These are: 

d) Component Summary. This form identifies the characteristics of each component in a 
system. It gives the size, complexity and interfaces. The goal was to have this form filled out at 
least twice - once when the component was first identified during design, and again when it was 
completed. Our experience was that the initial form was filled out before much relevant informa* 
tion was known, and the data on the final form could be extracted automatically from the source 
code data base. 

e) Computer Run Analysis. An entry on this form is filled out for each computer run giving 
characteristics of the run (execution time, purpose of run, components processed) as well as 
whether the run met its objectives. This is one form that could be automated. However, the 
usual range of operating system ’’Completion Codes” is inadequate for many purposes. For exam- 
ple, a debugging run that was expected to fail at a certain statement, but ran to a successful exit, 
would have a satisfactory completion code, yet it was a failure as a run since the desired error did 
not occur. 

- An interactiv job submittal system could help. Before any run, the system could prompt for 
some of this information. After the run, the system could ask what happened. Since the current 
NASA environment consists primarily of interactive editing with batch processing, such an online 
process would have been difficult to implement. 

f) Programmer Analyst Survey. This form attempts to characterize the experiences of the 
programmers on the project in order to get a general profile of the project tea However, we 
immediately ran into confidentiality problems concerning personnel records. We never got the 
detailed information that we desired, but have obtained general comments on each programmer - 
although the goal is NOT to rate programmers. If there is any hint of any of this data being used 
for any sort of personnel action, then compliance drops sharply and the value of the data becomes 
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open to question. 


g) General Project Summary. This is a form that provides a high-level description of a pro- 
ject. Since the software is developed by NASA and contractor personnel, the form is somewhat 
superfluous and the information is entered directly into the data base. 

An important consideration in forms development is consistency in collecting data. Along 
with each form a detailed instruction sheet was developed, as well as a glossary of relevant terms 
like ”component,” ”line of code,” and "life cycle phase.” For example, we chose the name ”com- 
ponent” rather than ”subroutine” or ”module” simply because those terms were well known (with 
alternative meanings) and we did not want to evoke any preconceived but wrong image in the 
minds of the participants. Even so, there was a great deal of confusion about the meanings of the 
various terms. During the early days of the SEL, many meetings were held to explain the process 
to programmers, since each programmer worked about one year on a project, after six years there 
is a large core of personnel experienced in filling out our reporting forms.. 

m.3 DATA PROCESSING 

After being filled out, each form is entered into a data base on a PDP 11/70 computer. In 
addition to the forms previously described, analyzers were run over the source programs to extract 
additional information, including lines of code and other measures such as the Halstead software 
science measures. 

Another step in forms processing is data validation. Someone must review the forms as they 
are submitted. This is expensive, but necessary. It is a quick was to catch and correct errors. In 
addition, the data entry program should check for data consistency and value ranges. For exam- 
ple, if the program is to read in input in the format MMDDYY, then a month input that is not a 
number in the range from 01 to 12 must be rejected. A field requiring an input of A, B, or C 
should reject any other value. Even though we manually check each form, a validation program 
was more effective for catching errors. 
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All forms, especially the change report form, need to be reviewed by SEL personnel. Two 
common errors in the Change report form are to turn in one change report form which actually 
represented several errors, and the submission of multiple forms for the same error. From earlier 
work over half of the change report forms were modified following a careful study of each form. 
This is an expensive process, but needs to be done in order to have accurate data about your 
environment. 

Redundancy of data is another important consideration. Collecting the same or similar data 
on multiple forms allows for cross validation. There should be a reasonable correlation between 
the collected values. The resource summary and component status reports have been the easiest 
to validate. The Computer Run Analysis form is important for validating some of the change 
report data; however, limited availability of this form has handicapped some of this validation 
work. Because of that, it is important to manually check each change report form for selected 
projects. 

IV RESEARCH ACTIVITIES 
IV.l PREVIOUS RESEARCH 

Research in the SEL has centered on resource and error models and on predicting software 
productivity. ([8] is a collection of relevant papers published over the last few years.) Perhaps the 
most important conclusion - although obvious in hindsight - which is relevant to this current dis- 
cussion is that there is no typical software development environment. 

Ail models include parameters - factors v/hich represent variables in that environment (Fig- 
ure 2 represents a list of factors from the SEL as well as two other studies [lO] [3].) When models 
based on other environments are applied to the NASA environment, they invariably fail. Does 
that mean that NASA is different? unique? much better or much worse than other environments? 
For example, SEL programmers show much higher productivity in lines of code per week than in 
other organizations. Does that mean that other organizations should pirate away NASA’s staff? 
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Perhaps, but another explanation becomes apparent when NASA’s environment is studied in 
detail. In the SEL, most of the projects are similar ground support software systems. Thus the top 
level design for these projects are similar. Programmers are experts at this particular problem - 
thus high productivity. Many factors affecting requirements and design do not apply here. On the 
other hand, a contractor that bids on a variety of projects - an operating system, a compiler, a 
data base management system, an attitude orbit determination program, etc. does not build an 
institutional knowledge about any one particular environment. Requirements and design factors 
now become significant in this environment and productivity drops. 

All companies operate in a different manner. Company policy as to working conditions, com- 
puter usage (batch or interactive), leave policy and salaries, management, support tools, etc. all 
affect productivity. Thus each organization (probably even separate divisions within a single 
organization) has a different structure and a different set of parameters. 

For this reason, one must first calibrate any model to be applied. First develop a quantita- 
tive relationship using many factors. Chose those factors relevant to your environment. Calibrate 
the equations based upon previous projects, and then use the calibrated model for prediction [2]. 
It is this important calibration step that is missing from most models. 

For example, if a baseline equation is given by: 

Effort = a * size + b 

then one can fit a and b from historical data; and the units of size can be determined from those 
relevant to your environment - such as lines of code, lines of source (including comments), number 
of modules, number of output statements, etc. Thus instead of a single model, there is a class of 
models tailored to each environment. 

IV.2 PROTOTYPES 

Over the past few years various methodologies have been studied by the SEL. A current 
SEL activity is the development of software prototypes. Currently software fs designed, built and 
delivered. Rarely is the product evaluated in advance. However, the use of engineering prototypes 
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in a preliminary evaluation is starting to be discussed by software engineering professionals [9j. 

While the term is appearing with increasing frequency, what does it really mean? Is it a 
quick and dirty throw-a-way implementation or a carefully designed subset of a final implementa* 
tion? What are the cost and reliability parameters for a prototype compared to a full implementa- 
tion. 

Currently data on the subject is meagre and usually based on small projects |12). The SEL 
is now investigating a larger implementation with some techniques as applied to previous SEL 
projects. 

Briefly, the target implementation is an integrated support system for flight dynamics 
research. Currently, experimenters (NASA scientists), in trying a new spacecraft model (e.g., a 
new orbit calculation) must understand the structure of the existing system, access the Fortran 
source modules, modify them, rebuild the operating program, test it, and then run the experiment 
- a complex and costly process. The new system is expected to "understand” several flight dynam- 
ics systems and to provide a higher level command language that guides the experimenter through 
the process of building a new version of a system, even if the experimenter is not thoroughly fami- 
liar with the existing system. This system is basically a command language interpreter with a 
complex data dictionary describing the underlying flight dynamics subsystems. 

This program is quite different from existing software produced by NASA, so the plan is to 
prototype it first. Two classes of data will be obtained from the prototype: 

a) Characteristics of the process. The Computer Science world has little information avail- 
able about prototyping, thus this data will add to the general knowledge about this process. What 
does the life cycle of a prototype look like? How much time is spent in design? code? test? Are 
errors crucial or can they be side-stepped in the prototype somewhat by "eliminating" the 
offending feature in the requirements? 

Similarly, how does prototyping effect the later full implementation? Will design be easier? 
Will productivity be higher? Will the overall cost of the system plus prototype be less than the 
cost of just the full system? Will reliability be higher or the interface more "user friendly?” 
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b) Predictive nature of the prototype. Once a prototype is built, is it successful? How does 
one measure success? Will the full system be successful based upon an evaluation of the proto- 
type? A set of measures will be built into the prototype to provide some of these answers. 

A baseline study will be made of how experiments are conducted - the cost of machine and 
people resources will be measured. Some of these experiments will be repeated with the prototype 
to derive a cost. These will be used to predict the cost of using the full system. If acceptable, then 
that design will be used for the full implementation, if not, then the design will be modified to 
correct the problem in the full implementation. 

In addition, data will be collected on how often features are used in the prototype, and abo 
how often the prototype b being circumvented in order to provide features that currently do not 
exbt but are needed by the users. 

Once the final system is built, the predictive model can be validated in order to aid in 
developing a theory of software prototypes. 

V. CONCLUSIONS 

The Software Engineering Laboratory has been in exbtence for six years and has studied 
almost 40 projects. The empirical data that has been collected supports several conclusions: 

(1) Data collection is hard and expensive. It must be dynamically collected during the 
development of a project and not after completion. 

(2) Data must be validated. Error rates on manually filled out forms are high. A lack of 
standardized nomenclature for the field hurts consbtency. Much effort must go in training person- 
nel to understand the data collection methodology. 

(3) Each software development environment is unique. Baseline equations must first be cali- 
brated with past projects before a model can be used in the future. 

(4) Little is known, but much is being said, about software prototypes. The SEL is 
currently studying this issue as part of its ongoing activities. 
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